Cleaning up data from outliers
This recipe describes how to deal with datasets coming from the real world and how to clean them before doing any visualization.
We will present a few techniques, different in essence but with the same goal, which is to get the data cleaned.
However, cleaning should not be fully automatic. We need to understand the data as given and be able to understand what the outliers are and what the data points represent before we apply any of the robust modern algorithms made to clean the data. This is not something that can be defined in a recipe because it relies on vast areas such as statistics, knowledge of the domain, and a good eye (and then some luck).
Getting ready
We will use the standard Python modules we already know about, so no additional installation is required.
In this recipe, I will introduce a new term, MAD. Median absolute deviation (MAD) in statistics represents a measure of the variability of a univariate (possessing one variable) sample of quantitative...