Classification and regression
Apache Spark provides a number of classification and regression algorithms. The main algorithms are listed as follows.
Classification
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Typically in classification cases, the dependent variables are categorical. A very common example is classification of e-mail as spam versus not spam. The major algorithms that come with Spark include the following:
- Logistic regression
- Decision tree classifier
- Random forest classifier
- Gradient- boosted tree classifier
- Multilayer perceptron classifier
- One-vs-Rest classifier
- Naïve Bayes
Regression
In machine learning and statistics, Regression is a process by which we estimate or predict a response based on the model trained based on previous data sets....