MLlib is a part of the Spark project that provides machine learning capabilities. One of the reasons to choose MLlib is that it's built on Apache Spark, which is a fast and general engine for large-scale data processing. One can find extensive documentation on MLlib at http://spark.apache.org/docs/latest/ml-guide.html. MLlib out of the box provides machine learning algorithms, such as the following:
- Classification: This is used by Gmail to categorize whether an email is spam or not.
- Clustering: This is categorization. Google uses this to categorize news articles into various categories such as sports, politics, weather, and so on, based on the title and content.
- Collaborative Filtering: This is used by the recommendation engines. YouTube and Amazon are classic examples for this as they recommend items based on likes and ratings from the user. ...