Unsupervised learning – clustering and dimensionality reduction
A lot of existing data is not labeled. It is still possible to learn from data without labels with unsupervised models. A typical task during exploratory data analysis is to find related items or clusters. We can imagine the Iris dataset, but without the labels:
While the task seems much harder without labels, one group of measurements (in the lower-left) seems to stand apart. The goal of clustering algorithms is to identify these groups.
We will use K-Means clustering on the Iris dataset (without the labels). This algorithm expects the number of clusters to be specified in advance, which can be a disadvantage. K-Means will try to partition the dataset into groups, by minimizing the within-cluster sum of squares.
For example, we instantiate the KMeans
model with n_clusters
equal to 3
:
>>> from sklearn.cluster import KMeans >>> km = KMeans(n_clusters=3)
Similar to supervised algorithms, we can use the fit
methods...