k-Nearest Neighbors
This algorithm belongs to a particular family called instance-based (the methodology is called instance-based learning). It differs from other approaches because it doesn't work with an actual mathematical model. On the contrary, the inference is performed by direct comparison of new samples with existing ones (which are defined as instances). KNN is an approach that can be easily employed to solve clustering, classification, and regression problems (even if, in this case, we are going to consider only the first technique). The main idea behind the clustering algorithm is very simple. Let's consider a data generating process pdata and a finite a dataset drawn from this distribution:
Â
Each sample has a dimensionality equal to N. We can now introduce a distance function d(x1, x2), which in the majority of cases can be generalized with the Minkowski distance:
When p = 2, dp represents the classical Euclidean distance, that is normally the default choice. In particular cases...