The EM algorithm by example of k-means clustering
Probably the most famous algorithm for clustering observations to groups is the k-means algorithm. We will see that this algorithm is just a variant of the EM algorithm.
Given n objects, characterized by p variables, we like to partition them into clusters
such that cluster
has
members and each observation is in one cluster. The mean vector (center, prototype), Vk, of a cluster
is defined as the centroid of the cluster and the components of the mean vector can be calculated by
where
is the number of observations in cluster
and
is the i-th observation belonging to cluster
. For each cluster
the corresponding cluster means
are calculated.
We also need to determine the number of clusters in the output partition. Starting from the given initial locations of the cluster centroids, the algorithm uses the data points to iteratively relocate the centroids and reallocate points to the closest centroid. The process is composed of these steps...