The data mining process (as a case study)
As Chapter 9, Introduction to Modeling Options in IBM SPSS Modeler will illustrate, there are many different types of data mining projects. For example, you may wish to create customer segments based on products purchased or service usage, so that you can develop targeted advertising campaigns. Or you may want to determine where to better position products in your store, based on customer purchase patterns. Or you may want to predict which students will drop out of school, so that you can provide additional services before this happens.
In this book, we will be using a dataset where we are trying to predict which people have incomes above or below $50,000. We may be trying to do this because we know that people with incomes above $50,000 are much more likely to purchase our products, given that previous work found that income was the most important predictor regarding product purchase. The point is that regardless of the actual data that we are using, the principles that we will be showing apply to an infinite number of data mining problems; whether you are trying to determine which customers will purchase a product, or when you will need to replace an elevator, or how many hotels rooms will be booked on a given date, or what additional complications might occur during surgery, and so on.
As was mentioned previously, Modeler supports the entire data mining process. The figure shown next illustrates exactly how Modeler can be used to compartmentalize each aspect of the CRISP-DM process model:
In Chapter 2, The Basics of Using IBM SPSS Modeler, you will become familiar with the Modeler graphic user interface. In this chapter, we will be using screenshots to illustrate how Modeler represents various data mining activities. Therefore the following figures in this chapter are just providing an overview of how different tasks will look within Modeler, so for the moment do not worry about how each image was created, since you will see exactly how to create each of these in later chapters.
First and foremost, every data mining project will need to begin with well-defined business objectives. This is crucial for determining what you are trying to accomplish or learn from a project, and how to translate this into data mining goals. Once this is done, you will need to assess the current business situation and develop a project plan that is reasonable given the data and time constraints.
Once business and data mining objectives are well defined, you will need to collect the appropriate data. Chapter 3, Importing Data into Modeler will focus on how to bring data into Modeler. Remember that data mining typically uses data that was collected during the normal course of doing business, therefore it is going to be crucial that the data you are using can really address the business and data mining goals:
Once you have data, it is very important to describe and assess its quality. Chapter 4, Data Quality and Exploration will focus on how to assess data quality using the Data Audit
node:
Once the Data Understanding
phase has been completed, it is time to move on to the Data Preparation
phase. The Data Preparation
phase is by far the most time consuming and creative part of a data mining project. This is because, as was mentioned previously, we are using data that was collected during the normal course of doing business, therefore the data will not be clean, it will have errors, it will include information that is not relevant, it will have to be restructured into an appropriate format, and you will need to create many new variables that extract important information. Thus, due to the importance of this phase, we have devoted several chapters to addressing these issues. Chapter 5, Cleaning and Selecting Data will focus on how to select the appropriate cases, by using the Select
node, and how to clean data by using the Distinct
and Reclassify
nodes:
Chapter 6, Combining Data Files will continue to focus on the Data Preparation
phase by using both the Append
and Merge
nodes to integrate various data files:
Finally, Chapter 7, Deriving New Fields will focus on constructing additional fields by using the Derive
node:
At this point we will be ready to begin exploring relationships within the data. In Chapter 8, Looking for Relationships Between Fields we will use the Distribution
, Matrix
, Histogram
, Means
, Plot
, and Statistics
nodes to uncover and understand simple relationships between variables:
Once the Data Preparation
phase has been completed, we will move on to the Modeling
phase. Chapter 9, Introduction to Modeling Options in IBM SPSS Modeler will introduce the various types of models available in Modeler and then provide an overview of the predictive models. It will also discuss how to select a modeling technique. Chapter 10, Decision Tree Models will cover the theory behind decision tree models and focus specifically on how to build a CHAID model. We will also use a Partition
node to generate a test design; this is extremely important because only through replication can we determine whether we have a verifiable pattern:
Chapter 11, Model Assessment and Scoring is the final chapter in this book and it will provide readers with the opportunity to assess and compare models using the Analysis
node. The Evaluation
node will also be introduced as a way to evaluate model results:
Finally, we will spend some time discussing how to score new data and export those results to another application using the Flat File
node: