Identifying and Cleaning Outliers
When confronted with real-world data, we often see a specific thing in a set of records: there are some data points that do not fit with the rest of the records. They have some values that are too big, too small, or that are completely missing. These kinds of records are called outliers
.
Statistically, there is a proper definition and idea about what an outlier means. And often, you need deep domain expertise to understand when to call a particular record an outlier. However, in this exercise, we will look into some basic techniques that are commonplace for flagging and filtering outliers in real-world data for day-to-day work.
Exercise 6.07: Outliers in Numerical Data
In this exercise, we will construct a notion of an outlier based on numerical data. Imagine a cosine curve. If you remember the math for this from high school, then a cosine curve is a very smooth curve within the limit of [1, -1]
. We will plot this cosine curve using the plot...