What this book covers
Chapter 1, Data Understanding, provides recipes related to the second phase of CRISP-DM with a focus on exploring the data and data quality. These are recipes that you can apply to data as soon as you acquire the data. Naturally, some of these recipes are also among the more basic, but as always, we seek out the nonobvious tips and tricks that will make this initial assessment of your data efficient.
Chapter 2, Data Preparation – Select, covers just the first task of the data preparation phase. Data preparation is notoriously time-consuming and is incredibly rich in its potential for time-saving recipes. The cookbook will have a total of four chapters on data preparation. The selection of which data rows and which data columns to analyze can be tricky, but it sets the stage for everything that follows.
Chapter 3, Data Preparation – Clean, covers the challenges the data miners face and is dedicated to just the second generic task of the data preparation phase. Sometimes new data miners assume that if a data warehouse is being used, data cleaning has been largely done up front. Veteran data miners know that there is usually a great deal left to do since data has to be prepared for a particular use to answer a specific business question. A couple of the recipes will be basic, but the rest will be quite complex, designed to tackle some of the data miners' more difficult cleaning challenges.
Chapter 4, Data Preparation – Construct, covers the third generic task of the data preparation phase. Many data miners find that there are many more constructed variables in the final model than variables that were used in their original form, as found in the original data source. Common methods can be as straightforward as ratios of part to whole, or deltas of last month from average month, and so on. However, the chapter won't stop there. It will provide examples performing larger scale variable construction.
Chapter 5, Data Preparation – Integrate and Format, covers the fourth and fifth generic tasks of the data preparation phase. Integrating includes actions in Modeler, which further include the Merge, Append, and Aggregate nodes. Formatting is often simply defined as reconfiguring data to meet software needs, in this instance, Modeler.
Chapter 6 , Selecting and Building a Model, explains what many novice data miners see as their greatest challenge, that is, mastering data mining algorithms. Data mining, however, is neither really all about that, nor is this chapter. A discussion of algorithms can easily fill a book, and a quick search will reveal that it has done so many times. Here we'll address nonobvious tricks to make your modeling time more effective and efficient.
Chapter 7, Modeling – Assessment, Evaluation, Deployment, and Monitoring, covers the terribly important topics, especially deployment, because they don't get as much attention as they deserve. Here too, deployment deserves more attention, but this cookbook's attention is clearly and fully focused on IBM SPSS Modeler and not on its sibling products such as IBM Decision Management or IBM Collaboration and Deployment Services. Their proper use, or some alternative, is part of the complete narrative but beyond the scope of this book. So, ultimately two CRISP-DM phases and a portion of a third phase are addressed in one chapter, albeit with a large number of powerful recipes.
Chapter 8, CLEM Scripting, departs from the CRISP-DM format and focuses instead on a particular aspect of the interface, scripting. This chapter is the final chapter with advanced concepts, but it is still written with the intermediate user in mind.
Appendix, Business Understanding, covers a special section and is an essay-format discussion of the first phase and arguably the most critical phrase of CRISP-DM. Tom Khabaza, Meta Brown, Dean Abbott, and Keith McCormick each contribute an essay, collectively discussing all four subtasks.