Observations that differ greatly from the rest of the observations, that is, they are located in the long tail(s) of the data distribution, are outliers. In this recipe, we will learn how to locate and handle the outliers.
Handling outliers
Getting ready
To execute this recipe, you need to have a working Spark environment. Also, we will be working off of the imputed DataFrame we created in the previous recipe, so we assume you have followed the steps to handle missing observations.
No other prerequisites are required.
How to do it...
Let's start with a popular...