Introduction and Data Types

From "Descriptive" to "Predictive"

Until now, we have been primarily concerned with describing, explaining, and visualizing what has happened in the past based on recorded data. That is an appropriate first step toward understanding what your data says about your business or organization. Descriptive visualizations (like those you made in Tableau) help you to identify potential explanations for what you are observice in your organization. However, the greatest forms of business value come from the next step in the process: predicting what will happen in the future. As we talked about in the very first chapter, it's time to establish "cause" and "effect." The overall "effect" that we want to explain is our organizations success, performance, and ability to beat competitors. If we understand what "causes" than effect, then we can focus our time, efforts, and resources to maximize those causes in order to maximize that effect. But "success" is a pretty high level variable; let's break that effect down into something smaller and more "bite-size."

For example, let's assume that we own a newly started bike shop. Our success is best represented by our sales volume. So we want to find out what causes a customer to buy a bike from our store. To do that, you begin by theorizing about what causes a customer to buy a bike from you. From Dictionary.com, a theory is "a coherent group of tested general propositions, commonly regarded as correct, that can be used as principles of explanation and prediction for a class of phenomena." Another definition from Wikipedia, "a theory is a contemplative and rational type of abstract or generalizing thinking, or the results of such thinking. Depending on the context, the results might, for example, include generalized explanations of how nature works." But let's substitute a couple of words toward the end:

Theory

Depending on the context, the results of a theory might, for example, include generalized explanations of how organizations operate efficiently and effectively, why employees perform at high and low levels, or why customers make (repeat) purchase decisions.

Therefore, a bike shop owner may theorize that people will buy her or his bikes because they live or work near the bike store, because they have lots of expendable income, because they live a "commutable" distance from their jobs, because they are older or younger, because they have more or less education, or because they have children. Notice, in each of these examples that there are cause variables (e.g. address, income, commute distance, age, education, children) that are associated with the effect variable of interest: whether or not they will purchase a bike.

However, identifying the variables is only half of the job of a theory. There must also be a logical explanation concerning why those variables will cause bike purchases. For example, if the bike store specializes in road bikes, then commute distance is relevant. If the bikes are expensive, higher-end models, then income is relevant. If the bikes are primarily for enjoyment (rather than exercise), then children is is a relevant variable.

So how do you come up with a valid theory to begin with? Well, this is an expertise of academic researchers. They are literally paid to 1) review all of the relevant research results on a topic (often over many decades of work), 2) develop a theory to explain all of those results, 3) collect and analyze new data to confirm their theories, and then 4) clearly explain these theories in a digestable form. The do a pretty darn good job of 1-3, but a pretty lousey job of 4 (full disclosure, I'm an academic researcher). However, you can find their theories and results all in one place at Google Scholar. If you don't want to read through the original academic papers, you can often find books and news articles that are much easier to follow.

Now, let's assume that you, the bike shop owner, have done your research and now you have a theory about what causes (or will cause) customers to buy bikes from you. The next step is to collect and measure data that represents the factors/variables involved. Since this course is not about survey techniques or other data collection, let's assume that you've been able to acquire the data you're looking for and it's store in the Excel workbook below (go ahead and download and open it).

Once you've opened and examined the data file, notice that not all factors (a.k.a. "variables") are the same. Some are numeric and continuous like income or age. Others are numeric, but ordinal like "commute distance" meaning that the numbers are broken into ordered "chunks" (e.g. 0-1, 1-2, 2-5, 5-10). Others are ordinal, but categorical rather than numeric like education. In order words, "high school" is greater than "partial high school." However, the data is text rather than numeric. Also, the difference between categories is not necessarily the same. For example, can you say for certain that the difference between "partial high school" and "high school" is the same as the difference between "high school" and "partial college" (answer: no)? Lastly, some categorical data has no order whatsoever. We can divide these types into dichotomous categorical (e.g. "gender," "home owner," and "purchased bike") versus nominal , or those with more than two categories (e.g. "region" and "occupation"). One of the most common nominal factors that is not included in this particular data set is consumer ethnicity. As you'll find out below, not all analyses can be performed with every data type.

Some variables can be converted between numeric and categorical (e.g. dichotomous variables like "home owner" can be turned into 0 and 1 and ordinal variables like "education" can be turned into integers (e.g. 1 = partial high school, 2 = high school, 3 = partial college, 4 = bachelors, 5 = graduate). However, other variables, like region, cannot. Be aware of these data types as you perform the analyses below. You'll notice in the Excel workbook that you downloaded that we've created multiple worksheets. The "Original" worksheet contains the data as it was originally collected. The "Numeric" worksheet contains that same data along with newly created columns that convert as many of those categorical (or fields containing text) fields to numeric fields as possible.

Before we can jump right into analyzing the effect of each variable on whether or not they purchased a bike, we need to understand the data and the "bivariate" relationships between each pair of variables. This will help guide our analyses more efficiently.

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="a47hm">