What are data anomalies?
An anomaly refers to something that is unexpected or a deviation from the norm. The classic example of an anomaly in data is an outlier, which is a data point that is distant in some way from the other data points in the collection. In addition to outliers, other types of anomalies could include data that is unexpectedly missing, or data that exhibits errors. In the grand scheme of the data mining process that we outlined in Chapter 1, Expanding Your Data Mining Toolbox, detecting data anomalies could be considered part of the data cleaning step, although in this chapter we will find that sometimes using data analysis techniques actually helps us with this cleaning task. In the next few pages, we will take a tour through these different types of anomalies, show what they might look like with real data examples, discuss why they happen, and outline a few simple ways to detect them.
Missing data
Even though missing data is not always the first thing people think...