6.3 Analyzing the Relationship

Pearson Correlation Coefficient

If the relationship between variables was always perfectly linear, the equation y = mx + b (y = predicted value, m = slope of the line of best fit, x = observed or measured value, and b = intercept) would be all that would is required to predict the value of one variable from the value of the other variable. Typical relationships in business and many other content domains are not perfectly linear. As in the example above, observations are scattered around the line of best fit. So we need an additional statistic.

The correlation coefficient is one of the most common and useful statistics. A correlation is a single number that describes the degree of relationship between two variables. The Pearson correlation, denoted by the symbol (r), describes the degree of linear relationships between two arrays of numbers.

Now we're ready to compute the correlation value. The formula for the correlation is shown below. Even though we have computers typically calculate correlation for us, the details of the calculation are included so you can see how correlation is calculated. Like all statistics, it helps to understand how they are calculated because it helps you know what the statistic actually means.

Freaking out a bit? Don't worry, you don't have to memorize that formula. Rather, it's easier to understand correlation coefficients by examining the scatter plot chart. Correlations vary from -1 to 1. "Stronger" correlations are those which are closer to -1 and 1. Values closer to 0 are "weaker." In other words, a correlation coefficient of -0.5 is equally as strong as 0.5. The only difference is that -0.5 means that as one variable increases , the other decreases about 50% of the time. On the other hand, 0.5 means that as one variable increases , the other also increases about 50% of the time. That's not a perfect interpretation, but it's close enough for the depth of this discussion. Take a look at the scatter plots and correlation coefficients below:

These first three are examples of perfect correlation. Notice that the actual slope of the line doesn't matter. It can be positive, negative, and not an integer. What's important is that, as x changes, the degree to which y changes by a predictable amount is the correlation. However, in real life, if you have two variables with a correlation of -1 or 1, then you are basically making a scatterplot of a variable against another version of itself (e.g. birth year and age). In other words, -1 and 1 are hypothetical limits. If we actually find a perfect correlation in practice, then we need to "throw out" one of them for future analysis. This is because there is a rule (see Regressio later in this chapter), that you can't include variables in a model that are too highly correlated.

These next correlations are strong, but not perfect. However, you can interpret them roughly as "a unit change in x is 91 or 87 percent likely to result in a unit change in y." Or, vice versa. Also, remember that just because one variable is along the Y axis and the other is on the X doesn't imply that X is causing Y.

These correlations are weak relative to those in the scatter plots above. However, they may still be statistically significant. Other tests are required to say conclusively whether these relationships are due to chance or due to an actual--albeit weak--relationship between the two variables. However, notice that no matter how strong or weak a correlation coefficient is, you can still draw a line of best fit representing the slope.

This is an example of no correlation whatsoever where r = 0.0. Notice that each of these scatter plots also include another statistic (R ² ). We will discuss this later.

Now it's time to calculate a correlation table in Excel based on the Bike Buyers data set. Watch and follow along with the video to learn how this is done.

Finally, sometimes you just need to calculate the correlation between two variables and you don't need an entire matrix. If so, try the CORREL function in Excel. All you need to do is input two continuous columns of values. For example: =CORREL(A1:A10, B1:B10). Try it out on the Bike Buyers data set you downloaded above to calculate the correlation between income and age. Did you get 0.17 like we did in the correlation matrix table? If not, watch the video below and try again:

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="pz2st">

Analysis of Variance (ANOVA)

As you've hopefully noticed, the Pearson Correlation Coefficient can only be calculated between pairs of numeric variables. Some categorical variables can be converted to numbers (e.g. education) and some cannot. For example, Education has an ordering to it: partial high school is less than high school, high school is less than partial college, etc. Therefore, you could convert education to numbers: 1 = partial high school, 2 = high school, 3 = partial college, 4 = bachelors, 5 = graduate, etc. However, that's not a perfect conversion. For example, is the progress required to move from partial high school to high school the same as the progress required to move fro bachelors to graduate? Maybe, maybe not. We don't know. However, by converting education to numbers, we are assuming that it must be the same difference because each type of education increases by an integer of 1. Sometimes it's better to leave an ordered variable, like education, as a categorical variable with no ordering assumed among values. Other times, we have no choice. For example, the "region" and "occupation" fields in the bike buyers data set cannot be ordered at all. If you try to conver North America = 1, Europe = 2, abd Pacific = 3, then the statistical algorithm thinks you mean that Europe is 1 unit more "of a region" than North America (because 2-1 = 1) and the Pacific is 2 units more "of a region" than North America. That makes no sense. We need a formula that will treat all categorical values as equal, but different from each other.

A one-way Analysis of Variance (ANOVA) is a test that will not only measure the relationship between a categorical variable and a numeric variable, but also estimate whether that numeric variable is statistically different across categories.

Follow along with the example in the video below to determine if the bike store customers' income is related to their education.

To review, an ANOVA tells you whether there is a statistically significant difference among all of the categorical values. In the case above, different levels of education are statistically related to variations in income. In other words, more education means more money. You know this because the ANOVA produces an F-statistic with an associated p-value. To keep things simple, we won't get into the details of what an F-statistic is in this class. However, suffice it to say that an F-statistic (labeled "F" in the Excel output) is a measure of the difference in some numerical variable (e.g. income) across all possible values of a categorical variable (e.g. education) that is due to actual differences between the categories and not to just random chance. Larger F-statistics indicate that the differences among categorical variable values are more likely due to true differences among categories. For example, a larger F-statistic in our example above means that there truly is something about getting more education that will lead to more income.

F-statistics have associated p-values, which are the probability of finding that same--or even stronger--value assuming that there is no actual relationship between the variables. In this case, "value" refers to the F-statistic produced by the ANOVA. However, p-values can be generated for Pearson correlation coefficients and many other statistical values as well (e.g. see the regression coefficients you'll generate later in the chapter).

To test your understanding, run another ANOVA to determine whether there is a significant difference in income across occupations. You should get a p-value so small that Excel can't show enough decimal places to get to the non-zero digits (6.33 E-195). So what does that mean? It means that there is almost no possibility that the differences found in income across various types of occupations is due to chance. In other words, different occupations have different levels of pay. Wow!!!! Brand new information! But, now, at least you can prove it with statistical support.

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="ud6m4">

TTEST

As you've just learned, ANOVAs are a great way to find out if there is a general difference in a numeric variable across a set of categorical groups. However, the ANOVA results do not tell you if one particular group is significantly different from another. For example, we know that, in the bike buyers data, there is a significant effect of education on income. However, we don't know, for example, if the income earned by a partial college consumer is significantly more than that of a consumer with just a high school degree--we only know that there is a general, overall effect. One way to compare specific pairs of groups is by performing a t-test. As with the F-statistic produced by an ANOVA, the t-test also produces a p-value indicating whether the difference between a pair of groups is due to chance (p > 0.05; i.e., partial high school and high school are basically the same) or due to the nature of the difference between to categories (p < 0.05; i.e., a full high school degree actually makes you smarter--and, therefore, earn more money--than a partial high school degree).

See the video example below:

Still struggling with what a p-value is? Watch this video for more info on p-values (but only if you need to):

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="wn4gv">

Previous Next