With k-means, we will need to specify the exact number of clusters that we want. The algorithm will then iterate until each observation belongs to just one of the k-clusters. The algorithm's goal is to minimize the within-cluster variation as defined by the squared Euclidean distances. So, the kth-cluster variation is the sum of the squared Euclidean distances for all the pairwise observations divided by the number of observations in the cluster.
Due to the iteration process that is involved, one k-means result can differ greatly from another result even if you specify the same number of clusters. Let's see how this algorithm plays out:
- Specify the exact number of clusters you desire (k)
- Initialize: k observations are randomly selected as the initial means
- Iterate:
- K clusters are created by assigning each observation to its closest cluster center...