Data mining and knowledge discovery process models
Data modeling, data preparation, database design, data architecture—the question that arises is, how do these and other similar terms fit together? This is no easy question to answer! Terms may be used interchangeably in some contexts and be quite distinct in others. Also, understanding the interconnectivity of any technical jargon can be challenging.
In the data world, data mining and knowledge discovery process models attempt to consistently define terms and contextually position and define the various data sub-disciplines. Since the early 1990s, various models have been proposed.
Survey of the process models
In the following table, we can see a comparison of blueprints for conducting a data mining project with three data processing models, all of which are used to discover patterns and relationships in data in order to help make better business decisions.
The following list is adapted from A Survey of Knowledge Discovery and Data Mining Process Models by Lukasz A. Kurgan and Petr Musilek, and published in The Knowledge Engineering Review, Volume 21, Issue 1, March 2006.
Later on, we will see how Tableau comes into play and makes this process easier and faster for us.
KDD |
CRISP-DM |
SEMMA |
|
Phase I |
Selection |
Business understanding |
Sample |
Phase II |
Pre-processing |
Data understanding |
Explore |
Phase III |
Transformation |
Data preparation |
Modify |
Phase IV |
Data mining |
Modeling |
Model |
Phase V |
Interpretation/ evaluation |
Evaluation |
Assess |
Phase VI |
Consolidate knowledge |
Deployment |
- |
Since CRISP-DM is used by four to five times the number of people as the closest competing model (SEMMA), it is the model we will consider in this chapter. For more information, see http://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html.
The important takeaway is that each of these models grapples with the same problems, particularly concerning the understanding, preparing, modeling, and interpreting of data.