How do we do data mining?
Since data mining is traditionally seen as one of the steps in the overall KDD process, and increasingly in the data science process, in this section we get acquainted with the steps involved. There are several popular methodologies for doing the work of data mining. Here we highlight four methodologies: Two that are taken from textbook introductions to the theory of data mining, one taken from a very practical process used in industry, and one designed for teaching beginners.
The Fayyad et al. KDD process
One early version of the knowledge discovery and data mining process was defined by Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth in a 1996 article (The KDD Process for Extracting Useful Knowledge from Volumes of Data). This article was important at the time for refining the rapidly changing KDD methodology into a concrete set of steps. The following steps lead from raw data at the beginning to knowledge at the end:
- Data selection: The input to this step is raw data, and the output of this selection step is a smaller subset of the data, called the target data.
- Data pre-processing: The target data is cleaned, oddities and outliers are removed, and missing data is accounted for. The output of this step is pre-processed data, or cleaned data.
- Data transformation: The cleaned data is organized into a format appropriate for the mining step, and the number of features or variables is reduced if need be. The output of this step is transformed data.
- Data mining: The transformed data is mined for patterns using one or more data mining algorithms appropriate to the problem at hand. The output of this step is the discovered patterns.
- Data interpretation/evaluation: The discovered patterns are evaluated for their ability to solve the problem at hand. The output of this step is knowledge.
Since this process leads from raw data to knowledge, it is appropriate that these authors were the ones who were really committed to the term knowledge discovery in databases rather than simply data mining.
The Han et al. KDD process
Another version of the knowledge discovery process is described in the popular data mining textbook Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber, and Jian Pei as the following steps, which also lead from raw data to knowledge at the end:
- Data cleaning: The input to this step is raw data, and the output is cleaned data.
- Data integration: In this step, the cleaned data is integrated (if it came from multiple sources). The output of this step is integrated data.
- Data selection: The data set is reduced to only the data needed for the problem at hand. The output of this step is a smaller data set.
- Data transformation: The smaller data set is consolidated into a form that will work with the upcoming data mining step. This is called transformed data.
- Data mining: The transformed data is processed by intelligent algorithms that are designed to discover patterns in that data. The output of this step is one or more patterns.
- Pattern evaluation: The discovered patterns are evaluated for their interestingness and their ability to solve the problem at hand. The output of this step is an interestingness measure applied to each pattern, representing knowledge.
- Knowledge representation: In this step, the knowledge is communicated to users through various means, including visualization.
In both the Fayyad and Han methodologies, it is expected that the process will iterate multiple times over the steps, if such iteration is needed. For example, if, during the transformation step the person doing the analysis realized that another data cleaning or pre-processing step, is needed, both of these methodologies specify that the analyst should double back and complete a second iteration of the incomplete earlier step.
The CRISP-DM process
A third popular version of the KDD process that is used in many business and applied domains is called CRISP-DM, which stands for CRoss-Industry Standard Process for Data Mining. It consists of the following steps:
- Business understanding: In this step, the analyst spends time understanding the reasons for the data mining project from a business perspective.
- Data understanding: In this step, the analyst becomes familiar with the data and its potential promises and shortcomings, and begins to generate hypotheses. The analyst is tasked to reassess the business understanding (step 1) if needed.
- Data preparation: This step includes all the data selection, integration, transformation, and pre-processing steps that are enumerated as separate steps in the other models. The CRISP-DM model has no expectation of what order these tasks will be done in.
- Modeling: This is the step in which the algorithms are applied to the data to discover the patterns. This step is closest to the actual data mining steps in the other KDD models. The analyst is tasked to reassess the data preparation step (step 3) if the modeling and mining step requires it.
- Evaluation: The model and discovered patterns are evaluated for their value in answering the business problem at hand. The analyst is tasked with revisiting the business understanding (step 1) if necessary.
- Deployment: The discovered knowledge and models are presented and put into production to solve the original problem at hand.
One of the strengths of this methodology is that iteration is built in. Between specific steps, it is expected that the analyst will check that the current step is still in agreement with certain previous steps. Another strength of this method is that the analyst is explicitly reminded to keep the business problem front and center in the project, even down in the evaluation steps.
The Six Steps process
When I teach the introductory data science course at my university, I use a hybrid methodology of my own creation. This methodology is called the Six Steps, and I designed it to be especially friendly for teaching. My Six Steps methodology removes some of the ambiguity that inexperienced students may have with open-ended tasks from CRISP-DM, such as Business Understanding, or a corporate-focused task such as Deployment. In addition, the Six Steps method keeps the focus on developing students' critical thinking skills by requiring them to answer Why are we doing this? and What does it mean? at the beginning and end of the process. My Six Steps method looks like this:
- Problem statement: In this step, the students identify what the problem is that they are trying to solve. Ideally, they motivate the case for why they are doing all this work.
- Data collection and storage: In this step, students locate data and plan their storage for the data needed for this problem. They also provide information about where the data that is helping them answer their motivating question came from, as well as what format it is in and what all the fields mean.
- Data cleaning: In this phase, students carefully select only the data they really need, and pre-process the data into the format required for the mining step.
- Data mining: In this step, students formalize their chosen data mining methodology. They describe what algorithms they used and why. The output of this step is a model and discovered patterns.
- Representation and visualization: In this step, the students show the results of their work visually. The outputs of this step can be tables, drawings, graphs, charts, network diagrams, maps, and so on.
- Problem resolution: This is an important step for beginner data miners. This step explicitly encourages the student to evaluate whether the patterns they showed in step 5 are really an answer to the question or problem they posed in step 1. Students are asked to state the limitations of their model or results, and to identify parts of the motivating question that they could not answer with this method.
Which data mining methodology is the best?
A 2014 survey of the subscribers of Gregory Piatetsky-Shapiro's very popular data mining email newsletter KDNuggets included the question What main methodology are you using for your analytics, data mining, or data science projects?
- 43% of the poll respondents indicated that they were using the CRISP-DM methodology
- 27% of the respondents were using their own methodology or a hybrid
- 7% were using the traditional KDD methodology
- The remaining respondents chose another KDD method
These results are generally similar to the 2007 results from the same newsletter asking the same question.
My best advice is that it does not matter too much which methodology you use for a data mining project, as long as you just pick one. If you do not have any methodology at all, then you run the risk of forgetting important steps. Choose one of the methods that seems like it might work for your project and your needs, and then just do your best to follow the steps.
For this book, we will vary our data mining methodology depending on which technique we are looking at in a given chapter. For example, even though the focus of the book as a whole is on the data mining step, we still need to motivate each chapter-length project with a healthy dose of Business Understanding (CRISP-DM) or Problem Statement (Six Steps) so that we understand why we are doing the tasks and what the results mean. In addition, in order to learn a particular data mining method, we may also have to do some pre-processing, whether we call that data cleaning, integration, or transformation. But in general, we will try to keep these tasks to a minimum so that our focus on data mining remains clear. One prominent exception will be in the final chapter, where we will show specific methods for dealing with missing data and anomalies. Finally, even though data visualization is typically very important for representing the results of your data mining process to your audience, we will also keep these tasks to a minimum so that we can remain focused on the primary job at hand: Data mining.