Machine learning on Spark and Hadoop
MLlib is a machine learning library on top of Spark that provides major machine learning algorithms and utilities. It is divided into two separate packages:
spark.mllib
: This is the original machine learning API built on top of Resilient Distributed Datasets (RDD). As of Spark 2.0, this RDD-based API is in maintenance mode and is expected to be deprecated and removed in upcoming releases of Spark.spark.ml
: This is the primary machine learning API built on top of DataFrames to construct machine learning pipelines and optimizations.
spark.ml
is preferred over spark.mllib
because it is based on the DataFrames API that provides higher performance and flexibility.
Apache Mahout was a general machine learning library on top of Hadoop. Mahout started out primarily as a Java MapReduce package to run machine learning algorithms. As machine learning algorithms are iterative in nature, MapReduce had major performance and scalability issues. So, Mahout stopped...