In this section, we will discuss the distribution-based clustering technique and its computational challenges. An example of using Gaussian mixture models (GMMs) with Spark MLlib will be shown for a better understanding of distribution-based clustering.
Distribution-based clustering (DC)
Challenges in DC algorithm
A distribution-based clustering algorithm like GMM is an expectation-maximization algorithm. To avoid the overfitting problem, GMM usually models the dataset with a fixed number of Gaussian distributions. The distributions are initialized randomly, and the related parameters are iteratively optimized too to fit the model better to the training dataset. This is the most robust feature of GMM and helps the model to...