MLlib is the original machine learning library that is provided with Apache Spark, the in-memory cluster-based open source data processing system. This library is still based on the RDD API. In a later chapter, we'll also learn how machine learning on the newer DataFrame and Dataset API works. In this chapter, we will examine the functionality provided with the MLlib library in terms of areas such as regression, classification, and neural network processing. We will examine the theory behind each algorithm before providing working examples that tackle real problems. The example code and documentation on the web can be sparse and confusing.
We will take a step-by-step approach in describing how the following algorithms can be used and what they are capable of doing:
- Architecture
- Classification with Naive Bayes
- Clustering with K-Means
- Image classification...