Using k-nearest neighbor to find outliers
Unsupervised machine learning tools can help us identify observations that are unlike others when we have unlabeled data; that is, when there is no target or dependent variable. (In the previous recipe, we used total cases per million as the dependent variable.) Even when selecting targets and factors is relatively straightforward, it might be helpful to identify outliers without making any assumptions about relationships between variables. We can use k-nearest neighbor to find observations that are most unlike others, those where there is the greatest difference between their values and their nearest neighbors' values.
Getting ready
You will need PyOD (Python outlier detection) and scikit-learn to run the code in this recipe. You can install both by entering pip install pyod
and pip install sklearn
in the terminal or powershell
(in Windows).
How to do it…
We will use k-nearest neighbor to identify countries whose attributes...