10.1 Overfitting and Dimension Reduction

Now that you've had a taste of how predictive modeling works in Azure ML Studio, let's dive into greater detail on how to select variables for your model. You may have noticed in the prior chapter that there was not examination of coefficient p-values. For example, when we created the Excel prediction calculator, we relied on the p-value of linear regression coefficients to determine which variables should be included in our model. Why doesn't Azure ML Studio's version of linear regression evaluation give us those results? Well, p-value is starting to fall out of favor with "big data" scientists becasue p-value is so greatly affected by sample size. You can find a "statistically significant" p-value for almost any relationship as long as you have enough data. Therefore, you'll learn in this chapter how to reduce your independent variables down to a smaller, more relevant set using alternative metrics to a p-value which will help you reduce the threat of "over-fitting" you models.

One of the dangers of having a lot of independent variables is that you may end up "overfitting" a predictive model. occurs when a model is excessively complex and is defined by having too many variables measured for the number of cases observed. The negative outcome of overfitting is that you may get predictions that are too "custom-fit" to the cases you have and they won't generalize well to the entire population (and future predictions). So, how do you prevent overfitting? You must minimize the number of independent variables in your model (a.k.a. "dimension reduction"). You can accomplish this in two ways: 1) select a fewer number of "orthogonal" (i.e. fully independent) variables out of those available, or 2) transform your existing variables into a smaller set of orthogonal variables. We'll cover examples of both below beginning with the former.

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="o674j">

Previous Next