Handling outliers
When we talk about outliers, we are referring to those observations that are very different from the rest of our data. Sometimes, outliers are exactly what we are looking for, such as when we want to detect anomalies in a running engine, or when we want to detect fraudulent transactions. Other times, outliers are mistakes in data collection and can result in a less accurate model. It is important to know whether you have outliers in your dataset, know what they represent, and remove them if necessary.
The common approach to finding outliers is by using a box plot. In Chapter 2, Exploring Data in Power BI, we created one for the Life Ladder score in 2019 of the World Happiness Report dataset as seen in the following figure:
In this figure, the box plot shows the distribution of the Life Ladder scores for all countries. At first glance, it seems to be normally distributed,...