Overview of the package
At the high level, MLlib exposes three core machine learning functionalities:
- Data preparation: Feature extraction, transformation, selection, hashing of categorical features, and some natural language processing methods
- Machine learning algorithms: Some popular and advanced regression, classification, and clustering algorithms are implemented
- Utilities: Statistical methods such as descriptive statistics, chi-square testing, linear algebra (sparse and dense matrices and vectors), and model evaluation methods
As you can see, the palette of available functionalities allows you to perform almost all of the fundamental data science tasks.
In this chapter, we will build two classification models: a linear regression and a random forest. We will use a portion of the US 2014 and 2015 birth data we downloaded from http://www.cdc.gov/nchs/data_access/vitalstatsonline.htm; from the total of 300 variables we selected 85 features that we will use to build our models. Also, out of...