Descriptive versus Predictive

Typically, the term data mining broadly encompasses the variety of techniques referred to at the top of the BI stack (presentation to user). However, there are important differences and capabilities of each of the tools and techniques listed at the top. What is "BI" to one person may not be "BI" to another. So, let's try to organize those tools and techniques by classifying them as being either descriptive or predictive. Descriptive data mining refers to the set of tools and procedures designed to analyze data in ways that describe the past and immediate present state of the business processes that the data are produced from. Predictive data mining refers to the set of tools and procedures designed to predict the most likely future outcomes including performance, states, preferences, and much more based on historical data.

Descriptive Tools and Techniques

Descriptive tools and techniques comprise most of those that people refer to when they speak of "data mining" or "BI". The idea here is to take large amounts of data (hence the term "big data") and summarize it into either a set of key performance indicators (KPIs) or ad-hoc measures.

Key performance indicators are pre-planned measures that have been carefully determined to indicate the organization's performance on a particular business process. The reason for the descriptive "key" is that these performance indicators have been determined to be those which are most crucial to a business's success. Because KPIs are so important, management typically wants to see them in real-time and have them available for review at any time. Therefore, KPIs are summarized into a user interface known as a dashboard . Good dashboards (i.e., more expensive dashboards) also give managers tools for exploring the KPIs in more detail and breaking them down into more specific measures. See the dashboard examples below and try to determine the KPI(s) they are depicting.

The dashboard above (found at http://www.brainsins.com/en/blog/dashboard-for-e-commerce/2147) is likely used by a sales manager who is tracking the performance of their product line. There are several key components. First, notice the most important KPI to a sales manager: revenue. Revenue is even depicted in an "automobile-like" dashboard form (which is where the term "dashboard" came from) in that it looks like a speedometer. Second, notice the OLAP cube. OLAP stands for "online analytical processing." OLAP cube is a common term used to describe the data table in the lower-left corner of the image above. Typically, the user can control the cube and change the dimensions that the data is summarized into. You'll learn to do something very similar (but on a much smaller scale) when you create Excel Pivot Tables in the next chapter. Third, notice that there is a prediction of future sales in the upper-right corner. Although we are currently talking about descriptive tools and analyses, this chart of future sales is an example of a more advanced type of predictive analysis that will be discussed in the next section. However, as you can see, dashboards can include both descriptive and limited predictive elements. Yet, their primary purpose is to focus on KPIs like the revenue visualization in the top center of the image above.

The image above depicts an Oracle-based dashboard for a call center manager (found at https://github.com/keen/dashboards). In this dashboard, the OLAP cube is replaced by an OLAP chart: the pie chart in the upper-right corner that divides the calls into their region of origin. Notice the two KPIs in the lower right corner: current call center usage and the 30-day average. Next, notice that there is a depiction of call center usage over time in the upper-left corner. Time-series analyses (e.g., showing trends in financial data over time) are a common element of most dashboards. Lastly, notice the alerts in the lower-left corner. Alerts are very useful automated messages that notify users when KPIs reach important levels. The reason for these alerts is because when these KPIs reach critical levels, some type of behavioral action is required in order to smooth out business operations or maximize their efficiency.

This third (and last) dashboard example above is one that would be used by a web manager. What is the role of the website manager? It is partly to ensure that the company website is being used as much as possible in ways that will help the company. In this example, the dashboard indicates the number of page-views over time, as well as several OLAP charts that divide page-views by region, device, browser, and possibly referring website. In summary, dashboards are a popular way to summarize a series of KPIs, data descriptions, and even data predictions in a single user interface. Dashboard design is an important area of research and a profitable business for skilled user-interface designers because they have such a strong effect on the successful consumption of big data by business users.

Often, managers will face new problems or strategy sessions where they need to create new data analyses and ad-hoc measures  (those created "on the fly" to help in decision making; typically associated with an ad-hoc database query). This is a primary reason that  Microsoft Excel is such a mainstream tool at every level in an organization. It has many features for data cleaning and analysis on the fly. Excel provides a nice platform for downloading relatively small amounts of data and creating (or experimenting) with new measures. Often, KPIs are created in Excel before they become true KPIs and are shifted into a more permanent online, web-based dashboard. Later, you will learn several very useful Excel features for data analysis: PivotTables (descriptive), Solver (predictive), and the Statistics ToolPak (both descriptive and predictive features).

Predictive Tools and Techniques

As it turns out, it's not all that difficult to calculate measures of our past performance. However, what if we can predict the future? Then, we can outsmart the competition by making the products that consumers really want (perhaps before they know they want them), recognizing viruses before they are a "known" virus, knowing if a consumer is going to buy our product before we give them the sales pitch, and much much much more. This is the purpose of predictive tools and techniques. The graphic below (from www.slideteam.net) depicts the conceptual relationship between the value of various forms of analyses and the complexity of performing those analyses. In summary, the analyses progress from basic database querying and reporting (descriptive) to analyses of those results using OLAP cubes and visualizations to more complex dashboards to, finally, predictive analyses, which help us "prescribe" future business actions.

Predictive data analysis requires relatively complex statistical formulas that use historical data to make predictions about the relationships between sets of variables. There are many forms of, and formulas for, these analyses. We'll review the main ones here. However, you will learn a couple of them in greater detail in later chapters.

Detecting Categories

Detecting categories (a.k.a. "cluster analysis") is the process of "clustering" records (remember records = rows = instances) in a database into groups of related records. For example, consider the number set: 1,1,1,1,2,2,5,5,5,5,6,6,6,6,9,9,9,9,10. How many clusters of related numbers are there? Hopefully, you said three. Let's assume those numbers refer to the number of products that each of a set of customers purchased during their last visit to our website. What if we have more data on each customer like the number of days since they made those purchases? That would require us to plot the values of those two variables for each customer like the image below:

What if we have three variables to cluster? See image below:

What if we have 35 variables for each customer? This is quite possible. However, there's no way to visualize clusters based on 35 dimensions. Yet, statistical algorithms can conceptualize 35 dimensions in your computer's memory and summarize your customers into as few as 2 (or many) basic clusters. Clustering analyses will not only tell you how many clusters were found, but also the primary characteristics (attribute:value pairs) of each cluster. This will allow you to group your customers into segments and create unique strategies for each segment.

Analyzing Key Influencers

Analyzing key influencers , or "key influencer analysis," is one of the most common techniques and involves measuring the correlation (i.e., the predictive ability) of a set of independent (a.k.a. "x") variables on typically one dependent (a.k.a. "y") variable. This is a great way to find out, for example, what characteristics of potential customers are related to their level of repeat purchases in your store. For example, as customers have more education, they may be more likely to make a purchase (represented by the line of best fit through the scatterplot below). However, it is very important to be wary of assuming causality even though you may find statistically significant results. For example, it may not be a customer's education that causes them to make purchases; but rather, their education led to greater income which led to purchases.

There are several statistical formulas used to make these kind of predictions. The image above depicts a regression analysis. Other formulas include Naive Bayes, Decision Trees, Neural Networks, and more.

Forecasting

Forecasting is the process of predicting future values over interval time periods based on known, measured values of the same interval periods. As a result, forecasting always has a standard time period and is charted over time. Sales revenues, profit, costs, and market demand are among the most common measures forecasted over time (i.e., time-series). The ARMA (autoregressive moving average) and ARIMA (autoregressive integrated moving average) formulas are among the most common statistical formulas used for forecasting.

Market Basket Analysis

Market basket analysis is a popular analysis for predicting consumer shopping patterns. In particular, it involves examining the products that have been grouped in the past by consumers (e.g., in a "shopping basket") and using that information to predict related items that each customer may want to purchase based on the shopping baskets of other customers who bought similar products (see image below). The statistical technique used to perform market basket analysis is called "association analysis." If you've ever visited amazon.com, then you've seen market basket analysis as new products are always suggested based on the product you are viewing.

Other Tools

The statistical formulas used in the analyses above can also be used to improve other important steps in data analysis. For example, if you have missing consumer data in your analytical database, many statistical formulas will automatically ignore all of that consumer's data because they require complete data to work at all. Let's say that only 8 percent of your customers have completed their entire online profile. It would be very sad to have to ignore the other 92 percent of your customers. So what options do you have? Well, you can start paying for external databases which might be able to fill in the gaps. However, those databases often have the exact same info you have. Another option is to fill in the missing values of each customer with the average of all other customers. While this will allow you to use more of your data, it will likely reduce the strength of your relationships. More recently, a popular technique has been to use the same statistical analysis used in key influencer analysis (e.g., regression) to predict the most likely value of the missing data based on the actual values of all other attributes of the record. For example, if you know that a customer is male, age 20, not a home owner, and works part time, your statistical regression model is likely to also predict that this person has a partial college education. While it is definitely possible that this prediction is wrong, it is much more likely to be accurate than using simply the most common value for education found in the data.

Another useful technique is to use the clustering algorithms found in category detection tools to identify records that are outliers. For example, your customers may have an average income of $75,000 per year. Therefore, a customer making $200,000 per year may appear to be an outlier. Removing outliers from the data is a great way to improve your predictive power. However, a clustering algorithm would examine all of the other attributes of this seeming outlier in concert and find that because that customer has a graduate degree and 30 years of full-time work experience, they are well within the normal range of customers. Similarly, another customer who earns the exact average of $75,000 would be identified as an outlier if they are 16 years old.

In summary, statistical formulas and techniques have become a mainstream in today's BI stack. Creating new and useful ways to integrate statistical prediction into your business processes is a great way to save costs, increase revenues, and become noticed by your managers.

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="yk2tk">