7.4 Step-By-Step Walkthrough
Okay, that was a lot to take in. Let's help you summarize everything into a set of "easy" steps. Keep in mind that these are the steps specifically for creating a prediction calculator in Excel--not a comprehensive summary of generic data mining steps for every type of tool. Other tools will allow for more advanced options for handling missing data and other issues. But we'll get to that in later chapters. This list below is a nice introduction, but you will learn to integrate other useful steps throughout the process as your skills develop.
-
Inspect the data
- What fields are there? What do they all mean?
- Are there binary (i.e. only two values) fields that are categorical that can be converted to a 0/1 scale?
- Are there categorical fields that can need to be turned into "dummy codes?" However, if there is a categorical variable that has greater than 16 values (e.g. "State" has 50 values), you will have to ignore that variable because Excel can't handle more than 16 variables.
- Are there missing data fields?
Are there cagetorical fields that can be converted to ordinal fields? (e.g. 1, 2, 3, 4, etc) -
Clean and prepare the data
- Make the changes to the data you planned
- Be sure to keep the original field values in case you mis-translate
- Cut-and-paste to move numeric fields next to each other so that they can be analyzed using the Excel Regression tool. Knowing that you can only analyze 16 independent variables at a time in Excel, you may want to group variables into "logical" sets. For example, if you turn a field into 10 dummy codes, keep those together.
-
Test for assumptions
- Multi-collinearity: You will need to calculate the variance inflation factor (VIF) for every variable and remove those that are too high (e.g. > 5.0). However, since you don't yet know which variables are going to be significant, you may want to save this step until you have decided on your final model.
- Normality: 1) Calculate skewness and kurtosis for all numeric variables, 2) Identify any values greater than 3.0 or less than -3.0; these are mildly concerning, 3) Identify any values greater than 5.0 or less than -5.0; these are very concerning. In your final calculator, add a note that these variables should be interpreted with caution if fall outside of normal ranges. What are the normal ranges? Calculate a 90% confidence intervale to identify the normal range.
- Linearity: 1) Import the new dataset (with newly created numeric fields) into Tableau, 2) For each continuous numeric variable (i.e. ignore binary variables), create a scatterplot of that variable against the dependent variable. In other words place the variable on the "columns" (i.e. X axis) and the dependent variable on the "rows" (i.e. Y axis). Next, add a linear, logarithm, and polynomial trendline to the scatterplot (don't forget to "de"-aggregate the measures (i.e. Tableau -> Analysis -> Aggregate Measures). 3) If either the logarithmic or polynomial trendlines are significant (i.e. p-value less than 0.05) and have an R squared value greater than the linear trendline, then create a new field(s) in the Excel data file for that non-linear transformation (see video above for example).
-
Test regression models
- Identify 16 variables to begin with. For the most part, this can be any 16 variables*. However, if you have categorical variables that have been broken down into dummy codes, make sure to keep them together in a single analysis. However, you can group multiple sets of dummy codes into a single analysis under 16 variables. For example, Region: US, Pacific, Europe gives you three variables. You can group that with Occupation: Management, Clerical, Professional, etc. as long as the total number of variables is under 16. *This is not ideal. However, it's the best that Excel will allow you to do.
- Run a regression analysis on those 16 variables. Keep track of the R squared value. Identify the two variables with the highest p-values (i.e. the two least significant values). Run another regresssion analysis without those two variables and include two new variables (as long as they are not part of a larger set of dummy codes) if you have some. Repeat this process until only significant variables are left. In other words, until only those with p-values below 0.05 are left. DO NOT worry about the coefficient values; only the p-values.
- Other notes: As you iterate through this process, your R squared value should increase as you add new variables and decrease when you remove variables--even when they are not significant. However, it will increase more when you include variables with significant p-values and decrease less when you remove variables with insignificant p-values. Therefore, your final group of variables is going to be those which give you the highest possible R squared value without including any variables that are not significant. Also, do not forget to test out any new variables you created to handle non-linear transformations (e.g. age^2, log(income)). They will not always be significant. However, you have to try them out if you got significant p-values when you tested them alone in Tableau.
-
Build the calculator interface in Excel (see video above for example)
- Create a list of the variables included in your "best" and final regression model down a column on a new sheet.
- The cell to the right of each variable name will act as the form field for the user to enter a potential value into.
- For each numeric variable (including binary variables like Gender, but NOT including dummy codes for categorical variables with 3 or more values), multiply the entered form value by the coefficient for that variable produced in your favorite regression model
- For each categorical variable, build a table to use for a vlookup function. The first column in the table will be each of the possible values represented (e.g. Region: US, Pacific, Europe). The second column in the table will have the coefficient for each of the dummy code fields you included in your model. If the data entered into the field doesn't fit into any of those category values, but you want to allow other values, then add a value "Other" with a coefficient of 0 in your table.
- At the bottom of your form fields, place a formula to total up each of the entered values multiplied by their associated coefficients In addition, add the intercept value from your favorite regression model into this "SUMPRODUCT" of coefficients * values. This is the calculator's "prediction." If you converted an ordinal dependent variable into a number in order to run this regression analysis (e.g. Education: 1 = partial high school, 2 = high school, etc), then make one more vlookup table that includes each of the possible values of the dependent variable. Your prediction formula will include a vlookup to this table to return the categorical value of the prediction.