Summary
In this last chapter, we looked at a variety of different types of data anomalies, including missing data, data errors, and outliers in data. We found many real-world examples of each of these errors, and determined that locating anomalies is important, no matter how we choose to do that. Some of the data anomalies must be located and fixed by hand using queries and domain knowledge, while others invite more sophisticated data mining approaches such as statistical methods and machine learning techniques.
The interesting thing about detecting outliers with machine learning is that we have decided to use data mining techniques in order to do better data mining. The author Douglas Adams once said that a computer nerd is someone who uses a computer in order to use a computer. I draw the line at calling us nerds when we use data mining in order to improve our data mining, but perhaps – as befits the title of the book – we can say with pride that we are getting better at Mastering Data...