Linear Regression

The analyses we have performed so far only estimate the relationship among pairs of variables. [Side note: even though there are several categories of eduation, we were still only examining two variables: education and income] However, that is only a start. Variables don't exist "in a vacuum", or in isolation from other variables. There are typically MANY variables involved in explaining a phenomenon. In other words, we need to move beyond bivariate statistics to multi-variate statistics.

As we move beyond pairs of variables, it's also time to designate variables as dependent (y) versus independent (x). In particular, we want to designate one (and only one) variable as the dependent (a.k.a. "y") variable--which is the variable that we want to explain or predict. Next, we need to designate one (and up to many) variable(s) as the independent (a.k.a. "x") variable(s)--which are those variables which we will use to explain or predict the dependent variable.

For example, the dependent Y variable is typically something that is valuable to predict like whether or not someone will purchase one of our bikes (see the variable PurchaseBike: 0 = no, 1 = yes in the data set provided). There will only be one dependent variable in our examples. the independent X variable(s) are those which would theoretically predict the dependent variable. That is the entire purpose of all of the other variables in the bike buyers data set. For example, people with more income are more likely to purchase bikes. People who live closer to work are more likely to purchase bikes (for commuting). Rather than examine how one independent variable at a time relates to the dependent variable (e.g. a Pearson correlation coefficient), we want to know what the combined effect of all independent variables together is on the dependent variable. To accomplish this, we need to move beyong the Pearnson correlation coefficient (r) to the coefficient of determination (R2)

Coefficient of Determination

You may have noticed in the scatter plots above that we also calculated a statistic called R 2 in addition to r. This is called the coefficient of determination which is a key output of fitting a line of best fit from a scatter plot. It is interpreted as the proportion of the variance in the dependent (y) variable that is predictable from the independent (x) variable(s). However, R 2 can also be calculated between a Y variable and a set of X variables.

For example, let's conceptualize the correlation coefficient as the amount of variance in one variable that overlaps with another variable. However, we want to ignore whether the relationship is positive or negative. Rather, we simply want to know how much of a Y variable can be explained, or predicted, by an X variable. See the diagram below:

If we square the correlation coefficient r = .14 for that relationship (see the correlation table created earlier), we get an R 2 value of 0.02. In other words, 2% of the variance in PurchaseBike can be explained by variance in Education which is represented (although a bit exaggerated) by the overlap between the two circles above. However, we have collected many variables which might overlap with, or explain, PurchaseBike. We have also included CommuteDistance in the figure below:

There are two things to learn from this image. First, R 2 is a representation, not only of the effect of a single X variable on a Y variable, but also the total summed overlap of all X variables on a Y variable. Imagine adding a circle to that diagram for every factor we have measured in the Bike Buyers data set. The total overlap of all variables with PurchaseBike is the R 2 value we are interested in.

Second, notice that commute distance is correlated not only with PurchaseBike, but also with the other independent variable Education. In addition, part of the overlap between Education and PurchaseBike is also overlapped with Commute distance. So is the true relationship between Education and PurchaseBike best represented by the correlation coefficient between those two variables? No, it's better to analyze the effects of a set of X variables at once in order to see what individual effect each independent variable has on a dependent variable after removing, or "controlling for," the effects of all other variables.

In the figure below, the true effect of Education is represented by only the portion that doesn't overlap with all other independent variables. So how do we measure just that portion that is due only to education? That is one of the purposes of multiple regression.

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="ch2g0">

Linear Regression

Regression is a powerful statistics analysis that allows you to measure the relationship between a dependent (output) variable and, not just one, but a set of independent (input) variables. As a result, the effect of each independent variable is controlled for by the effects of the other independent variables.

In linear regression, data is modeled using linear predictor functions (think about drawing a straight, or "linear", line through the data; as one variable goes up, the other variable goes up or down in a "linear" equation: Y = mx + b). This allows unknown model parameters to be estimated from the data. As a result, multiple linear regression is a great first step toward predicting unknown future "Y" values based on a set of known existing "X" values. Using the previous data set, follow along with the video tutorial to see how a basic prediction calculator can be produced in Excel. 

As you can probably tell if you followed along with the video above, multiple regression-based prediction calculators are somewhat complex, but also EXTREMELY powerful tools that are rarley used in practice by the "average employee" simply because they don't understand them or don't realize how easy they are to create in Excel.

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="mk39f">

Including Categorical Variables

You may be wondering by now why we have variables in the Bike Buyers data set like "Region" and "Occupation" when we can't analyze them in a regression model. Well, guess what? We can. We just have to make some modifications to them. Let's also re-analyze "Education" and treat it as a categorical varible and see if we get any better results than when we converted it to an ordinal variable (i.e. partial high school = 1, high school = 2, etc). Watch the video below to learn how to create dummy codes to analyze categorical variables in a regression model:

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="ag3mn">