CRISP-DM: Data Mining Process

In the last chapter, you covered a fairly in-depth set of techniques in Excel for turning a dataset into a prediction calculator. You may also remember that we summarized a structured step-by-step process. You may be wondering if there are industry standards for this process. Yes there are. So before we go any further, it's time we "put it all together" in a data science methodology. Hopefully, this methodology will be more understandable now that you've gone through most of it in a simple example in the previous chapter.

Methodologies are simply frameworks for performing tasks that help us to be cover every importent step. You may not have realized it, but everything we've learned so far fits into just such a methodology: CRISP-DM...a cross-platform industry standard process for data mining. CRISP-DM is currently the leading framework used and taught for data mining. Although this framework is standard, the specific tools, software, and statistical techniques vary greatly. See the figure below for a depiction of the phases involved:

This figure is attributed to: Kenneth Jensen: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf , CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610

Phases of CRISP-DM

We will briefly explain each phase below (with portions drawn from https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining).

Business Understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.

Data Understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

This phase is where most of the course has been focused on so far. Visualizing data in Tableau, creating clusters, testing for non-linearities, normality, and heteroskedasticity are all techniques used to understand the data. The vast majority of your time will, and should, be spent on understanding the data. Because if you don't understand the data, then you are taking large risks when it comes time to perform modeling and make business decisions on those models.

Data Preparation

The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools

Recall tasks we've performed on the Bike Buyers and Lending Tree data to change variables from categorical to ordinal, make logarithmic transformations, and to calculate exponential variants are all examples of data preparation. Typically, the data scientists will initially perform these transformations manually until they are sure of their results. Then, the cleaning and preparation will be automated into an ETL process.

Modeling

In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.

This phase is where much of the remainder of this course will focus. We will learn many different forms of modeling and alternative calculations for each modeling type. In the previous chapter, the "modeling" phase was the part where you analyzed regression models which produced coefficients (a.k.a. "weights") that explained exactly how important each varible was.

Evaluation

At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

How was this step performed in the last chapter? Well, we started some prelimiary evaluation when we examied R squared, which indicated how well the regression model variables (age, income, commute distance, etc) expalined the dependent variable (purchased bike)

Deployment

Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.

In the bike shop example, we deployed the model by building the Excel-based calculator that made use of the model results. However, that is a very limited deployment because we would then have to make that Excel file available to others and have a mechanism to update the model occasionally as more data is collected. In practice, we integrate model results into information systems (e.g. think Amazon's prediction of which products you'd like based on what you've already clicked on). The information system not only uses the model results to guide customer/user behavior, but also records the new behaviors (e.g. purchase decisions) which are fed back into future iterations of the model: thus, completing the loop required for machine learning to occur.

Perhaps the most important take-away from CRISP-DM is that data mining is an iterative process, often with no clear "finish" point. Rather, data scientists have to continually evaluate and re-evaluate their models as new data is gathered and new algorithms emerge. However, at some point, a decision needs to be made concerning the best model so that deployment can begin happening. Data scientists much learn to "satisfice" so that the need for perfection doesn't end up costing more than the benefits of incremental improvements to the models.

Summary

Prior to this chapter, you've been learning techniques for Phases 2 and 3: Data Understanding and Data Preparation. However, you've also had a "taste" in the prior chapter of Phases 4, 5, and 6: Modeling, Evaluation, and Deployment when you built the bike buyers prediction calculator. Now that you've learned the high-level phases of the overall data mining process, let's learn some "high-end" tools that will help use do a better job of Modeling and Evaluation. However, to be clear, we will not be able to go further into the Deployment phase in this course beyond a basic prediction calulator.

<{http://www.bookeducator.com/Textbook}learningobjectivelink target="v95a3">