Generative Gaussian mixtures is an inductive algorithm for semi-supervised clustering. Let's suppose we have a labeled dataset (Xl, Yl) containing N samples (drawn from pdata)Â and an unlabeled dataset Xu containing M >> N samples (drawn from the marginal distribution p(x)). It's not necessary that M >> N, but we want to create a real semi-supervised scenario, with only a few labeled samples. Moreover, we are assuming that all unlabeled samples are consistent with pdata. This can seem like a vicious cycle, but without this assumption, the procedure does not have a strong mathematical foundation. Our goal is to determine a complete p(x|y) distribution using a generative model. In general, it's possible to use different priors, but we are now employing multivariate Gaussians to model our data:
Thus, our model parameters...