Chapter 5. Cleaning and Selecting Data
The previous chapter focused on the data understanding phase of data mining. We spent some time exploring our data and assessing its quality. The previous chapter also mentioned that the Data Audit node is usually the first node that is used to explore and assess data. You were introduced to this node's options and learned how to look over its results. You were also introduced to the concept of missing data and shown ways to address it. In this chapter, you will learn how to:
- Select cases
- Sort cases
- Identify and remove duplicate cases
- Reclassify categorical values
Having finished the initial data understanding phase, we are ready to move onto the data preparation phase. Data preparation is the most time-consuming aspect of data mining. In fact, even in this very brief introductory book, we are devoting several chapters to this topic because it will be an integral part of every data mining project. However, every data mining project will require different...