Removing outliers
Trimming, or truncating, is the process of removing observations with outliers in one or more variables in the dataset. There are three commonly used methods to set the boundaries beyond which a value can be considered an outlier. If the variable is normally distributed, the boundaries are given by the mean plus or minus three times the standard deviation, as approximately 99% of the data will be distributed between those limits. For normally as well as not normally distributed variables, we can determine the limits using the IQR proximity rule or by directly setting the limits to the 5th and 95th quantiles. In this recipe, we are going to use the IQR proximity rule to identify and then remove outliers, using pandas, and then we will automate this process for multiple variables, utilizing Feature-engine.
How to do it...
Let’s first import the Python libraries and load the data:
- Let’s import the Python libraries, functions, and classes:
import...