Dealing with common data issues
Make no mistake, you will spend a large chunk of your time cleaning data or dealing with messy data, either dealing with mislabeled data, wrong formats, or missing data, among other issues. In this section, we will go through the most common problems that will affect your modeling efforts. Let’s start with outliers and missing values.
Bill Gates walks into a bar
The classical example of outlier effects is as follows:
We touched upon this issue when we discussed the difference between the mean and the median. But how do you deal with it? The easiest way is simply to remove the data point. You essentially ignore it and assume it does not exist. If you are dealing with a lot of data points, this might seem reasonable. But as an analyst...