Introducing MLlib
If you're doing any real data or science data mining or machine learning stuff with Spark, you're going to find the MLlib library very helpful. MLlib (machine learning library) is built on top of Spark as part of the Spark package. It contains some useful libraries for machine learning and data mining and some functions that you might find helpful. Let's review what some of those are and take a look at them. When we're done, we'll actually use MLlib to generate movie recommendations for users using the MovieLens dataset again.
MLlib capabilities
The following is a list of different features of MLlib. They have support in the library to help you with these various techniques:
- Feature extraction
- Term Frequency / Inverse Document frequency useful for search
- Basic statistics
- Chi-squared test, Pearson or Spearman correlation, min, max, mean, and variance
- Linear regression and logistic regression
- Support Vector Machines
- Naïve Bayes classifier
- Decision trees
- K-Means clustering
- Principal...