Assumptions of Linear Regression

Before you go off and apply linear regression to solve all of life's problems, you should know that there are certain "assumptions" that must be met before you can "linearly regress." This is very important. If your data doesn't meet these assumptions, then you can make some "invalid" predications. Checking for these assumptions helps you identify the "boundaries" around where your predictions will be strong versus weak (i.e. what range of values are valid?). We will briefly explain some of them here and show you how to check for these assumptions below:

  • Linear relationship

  • Multivariate normality

  • Multi-collinearity

  • No auto-correlation

  • Homoscedasticity

Linear Relationship

First, there must be a linear relationship in the data to detect. Remember earlier when we talked about "lines of best fit" that are not so straight? Well, as the name implies, "linear" regression is based on drawing a straightline through the data--but based on many variables rather than just one as in the scatterplot tutorial above. What of the spread of "dots" in a scatterplot do not appear to follow a straight line? That is a problem. We'll show you how to fix it using either Excel or Tableau. To see the Excel solution, follow along with this video:

To see how you can use Tableau to automatically find the best polynomial trendline (and fix it in Excel), follow along with these videos:

Multivariate Normality

Second, the assumption that all variables are "normal" can be a bit dicey. To detect normality, first plot the data in a histogram to see if the distribution falls along a "normal" looking bell curve. See the image below:

Notice how the distribution falls nicely along a perfect bell curve. However, not all data will be normally distributed. Sometimes it suffers from "skewness" and/or "kurtosis" problems. The image below visually depicts skewness and kurotsis problems:

So how do you know for sure if your data is suffering from these problems? There's an Excel function for that. Follow along with the video below:

Multi-Collinearity

The next issue we need to tackle is multi-collinearity. This means that two or more of the independent variables in the regression model are too highly correlated with each other. When variables are too highly correlated, it makes their coefficients less "interpretable" and you can lose the true statistical significance of a variable. There are several methods for testing for multi-collinearity. Calculating the Variance Inflation Factor (VIF) is the best technique. Follow along with this video below to calculator VIF in the bike buyers data.

So, what is an acceptable level of VIF? There are many research papers that will tell you different things. A very conservative guideline is > 3.0. However, others will say that up to 5.0 is okay. Some have even said that > 10.0 is acceptable. I recommend making a notation of any variable that is between 3 to 5 and removing any variable that is over 5.

Autocorrelation

Next, linear regression analysis requires that there is little or no autocorrelation (a.k.a. "serial correlation) in the data. Autocorrelation occurs when the residuals are not independent from each other. For instance, this typically occurs in stock prices, where the price is not independent from the previous price, yet both prices are included in a model. In other words, this would have occured if one of our independent variables was a "later" version of another independent variable. For example, have their current income as a variable. What if we also collected their historical income from the date they graduated from college because we believe that their increase in income from graduate until now is an indicator of whether or not they'd purchase a bike? Would that matter? I have no idea. But if we included two fields (current income and historical income), then we would be violating the assumption of auto-correlation. To solve this, either 1) remove one of the fields, or 2) calculate a difference between them or a percent increase that represents both variables and only include that one variable in the regression model. We haven't violated this in the existing bike buyers data; so let's move on.

Homoscedasticity

Finally, the variables should have little to no homoscedasticity. Homoscedasticity occurs when the variance is not the same along the entire regression line. See the example below:

So looking at a scatterplot of the data is helpful, but how do we know for sure if there is heteroscedasticity present (i.e. we don't meet the assumption of homoscedasticity). Follow along with the video below that demonstrates the Breusch-Pagan Test and the Abridged White's Test:

So let's say you have heteroskedasticity. What do you do next? The nice thing is, some of those fixes we took care of previously (e.g. for non-linear relationships) will also fix your heteroskedasticity problem (e.g. logarithmic transformation). So fix those issues first and test again. We don't have room for a complete discussion of all possible fixes here, but you can find a lot of good info online (https://en.wikipedia.org/wiki/Heteroscedasticity#Fixes)

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="l1013">