Chapter 7. Unsupervised Learning at Scale
In the previous chapters, the focus of the problem was on predicting a variable, which could have been a number, class, or category. In this chapter, we will change the approach and try to create new features and variables at scale, hopefully better for our prediction purposes than the ones already included in the observation matrix. We will first introduce the unsupervised methods and illustrate three of them, which are able to scale to big data:
- Principal Component Analysis (PCA), an effective way to reduce the number of features
- K-means, a scalable algorithm for clustering
- Latent Dirichlet Allocation (LDA), a very effective algorithm able to extract topics from a series of text documents