Introduction to Spark MLLib
MLLib stands for Machine Learning Library in Spark and is designed to make machine learning scalable, approachable, and easy for data scientists and engineers. It was created in the Berkley AMPLab and shipped with Spark 0.8.
Spark MLLib is a very active project with huge contributions from the community and an ever growing coverage of machine learning algorithms in the areas of classification, regression, clustering, recommendation, and other utilities such as feature extraction, feature selection, summary statistics, linear algebra, and frequent pattern matching.
Version 0.8 started small with the introduction of limited algorithms, such as:
- KMeans
- Alternating Least Squares (ALS)
- Gradient Descent (Optimization Technique)
From an API perspective, support for these algorithms was made available in the following programs:
- Java
- Scala
The amazing pace of MLLib can be gauged from the fact that within 3 months, version 0.9 was launched, which added the following...