Cross Industry Standard Process for Data Mining (CRISP-DM) is a process methodology for developing data mining applications. It was created before the term data science became popular, it's reliable and time-tested by several generations of analytics. These practices are still useful nowadays and describe the high-level steps of any analytical project quite well.
The CRISP-DM methodology breaks down a project into the following steps:
- Business understanding
- Data understanding
- Data preparation
- Modeling
- Evaluation
- Deployment
The methodology itself defines much more than just these steps, but typically knowing what the steps are and what happens at each step is enough for a successful data science project. Let's look at each of these steps separately.
The first step is Business Understanding. This step aims at learning what kinds of problems the business has and what they want to achieve by solving these problems. To be successful, a data science application must be useful for the business. The result of this step is the formulation of a problem which we want to solve and what is the desired outcome of the project.
The second step is Data Understanding. In this step, we try to find out what data can be used to solve the problem. We also need to find out if we already have the data; if not, we need to think how we can we get it. Depending on what data we find (or do not find), we may want to alter the original goal.
When the data is collected, we need to explore it. The process of reviewing the data is often called Exploratory Data Analysis and it is an integral part of any data science project. it helps to understand the processes that created the data, and can already suggest approaches for tackling the problem. The result of this step is the knowledge about which data sources are needed to solve the problem. We will talk more about this step in Chapter 3, Exploratory Data Analysis.
The third step of CRISP-DM is Data Preparation. For a dataset to be useful, it needs to be cleaned and transformed to a tabular form. The tabular form means that each row corresponds to exactly one observation. If our data is not in this shape, it cannot be used by most of the machine learning algorithms. Thus, we need to prepare the data such that it eventually can be converted to a matrix form and fed to a model.
Also, there could be different datasets that contain the needed information, and they may not be homogenous. What this means is that we need to convert these datasets to some common format, which can be read by the model.
This step also includes Feature Engineering--the process of creating features that are most informative for the problem and describe the data in the best way.
Many data scientists say that they spend most of their time on this step when building Data Science applications. We will talk about this step in Chapter 2, Data Processing Toolbox and throughout the book.
The fourth step is Modeling. In this step, the data is already in the right shape and we feed it to different Machine Learning algorithms. This step also includes parameter tuning, feature selection, and selecting the best model.
Evaluation of the quality of the models from the machine learning point of view happens during this step. The most important thing to check is the ability to generalize, and this is typically done via cross validation. In this step, we also may want to go back to the previous step and do extra cleaning and feature engineering. The outcome is a model that is potentially useful for solving the problem defined in Step 1.
The fifth step is Evaluation. It includes evaluating the model from the business perspective--not from the machine learning perspective. This means that we need to perform a critical review of the results so far and plan the next steps. Does the model achieve what we want? Additionally, some of the findings may lead to reconsidering the initial question. After this step, we can go to the deployment step or re-iterate the process.
The, final, sixth step is Model Deployment. During this step, the produced model is added to the production, so the result is the model integrated to the live system. We will cover this step in Chapter 10, Deploying Data Science Models.
Often, evaluation is hard because it is not always possible to say whether the model achieves the desired result or not. In these cases, the evaluation and deployment steps can be combined into one, the model is deployed and applied only to a part of users, and then the data for evaluating the model is collected. We will also briefly cover the ways of doing them, such as A/B testing and multi-armed bandits, in the last chapter of the book.