CRISP-DM
Cross Industry Standard Process for Data Mining (CRISP-DM) was created between 1996 and 2000 as a result of a consortium including SPSS, Teradata, Daimler AG, NCR Corporation, and OHRA. It divides the process of data mining into six major phases, as shown in the CRISP-DM reference model in the preceding comparison table.
This model provides a bird's-eye view of a data-mining project life cycle. The sequence of the phases are not rigid; jumping back and forth from phase to phase is allowed and expected. Data mining does not cease upon the completion of a particular project. Instead, it exists as long as the business exists, and should be constantly revisited to answer new questions as they arise.
In the next section, we will consider each of the six phases that comprise CRISP-DM and explore how Tableau can be used throughout the life cycle. We will particularly focus on the data preparation phase, as that is the phase encompassing data cleaning, the focus of this chapter. By considering the following steps, you will be able to understand in more detail what a full data mining process circle looks like under CRISP-DM. This framework can be used to make your workflow in Tableau more efficient by working according to an established model.
CRISP-DM phases
In the following sections, we will briefly define each of the six CRISP-DM phases and include high-level information on how Tableau might be used.
Phase I – business understanding:
- This phase determines the business objectives and corresponding data mining goals. It also assesses risks, costs, and contingencies, and culminates in a project plan.
- Tableau is a natural fit for presenting information to enhance business understanding.
Phase II – data understanding:
- This phase begins with an initial data collection exercise. The data is then explored to discover early insights and identify data quality issues.
- Once the data is collected into one or more relational data sources, Tableau can be used to effectively explore the data and enhance data understanding.
Phase III – data preparation:
- This phase includes data selection, cleaning, construction, merging, and formatting.
- Tableau can be effectively used to identify the preparation tasks that need to occur; that is, Tableau can be used to quickly identify the data selection, cleaning, merging, and so on, that should be addressed. Additionally, Tableau can sometimes be used to do actual data preparation. We will walk through examples in the next section.
As Tableau has evolved, functionality has been introduced to do more and more of the actual data preparation work as well as the visualization. For example, Tableau Prep Builder is a standalone product that ships with Tableau Desktop and is dedicated to data prep tasks. We will cover Tableau Prep Builder in Chapter 3, Tableau Prep Builder.
Phase IV – modeling:
- In this phase, data modeling methods and techniques are considered and implemented in one or more data sources. It is important to choose an approach that works well with Tableau; for example, as discussed in Chapter 6, All About Data – Data Densification, Cubes, and Big Data, Tableau works better with relational data sources than with cubes.
- Tableau has some limited data modeling capabilities, such as pivoting datasets through the data source page.
Phase V – evaluation:
- The evaluation phase considers the results; do they meet the business goals with which we started the data mining process? Test the model on another dataset, for example, from another day or on a production dataset, and determine whether it works as well in the workplace as it did in your tests.
- Tableau is an excellent fit for considering the results during this phase, as it is easy to change the input dataset as long as the metadata layer remains the same—for example, the column header stays the same.
Phase VI – deployment:
- This phase should begin with a carefully considered plan to ensure a smooth rollout. The plan should include ongoing monitoring and maintenance to ensure continued streamlined access to quality data. Although the phase officially ends with a final report and accompanying review, the data mining process, as stated earlier, continues for the life of the business. Therefore, this phase will always lead to the previous five phases.
- Tableau should certainly be considered a part of the deployment phase. Not only is it an excellent vehicle for delivering end-user reporting; it can also be used to report on the data mining process itself. For instance, Tableau can be used to report on the performance of the overall data delivery system and thus be an asset for ongoing monitoring and maintenance.
- Tableau Server is the best fit for Phase VI. We will discuss this separate Tableau product in Chapter 14, Interacting with Tableau Server/Online.
Now that we have learned what a full data mining circle looks like (and looked like pre-Tableau) and understood that every step can be executed in Tableau, we can see how it makes sense that data people celebrate Tableau Software products.
The phrase "data people" is especially memorable after listening to the song written for the 2019 Las Vegas Tableau Conference, at https://youtu.be/UBrH7MXf-Q4.
Tableau makes data mining so much easier and efficient, and the replication of steps is also easier than it was before, without Tableau. In the next section, we will take a look at a practical example to explore the content we've just learned with some hands-on examples.