Semi-supervised scenario
A typical semi-supervised scenario is not very different from a supervised one. Let's suppose we have a data generating process, pdata:
However, contrary to a supervised approach, where we can rely on a completely labeled dataset, we have only a limited number N of data points drawn from pdata and provided with a label, as follows:
As for other methods, the training sample is assumed to be drawn uniformly, so as not to exclude any region of pdata. When this condition is met, it's possible to consider a larger amount (M) of unlabeled samples drawn from the marginal distribution :
The context of semi-supervised learning is then defined by the union of the two sets {XL, YL} and XU. An important assumption about the unlabeled samples is that their labels are supposed to be missing at random, without any correlation with the actual label distribution. The unlabeled dataset is assumed to have a distribution that doesn...