Feature preparation
In the previous section, we selected our models and also prepared our dependent variable for our supervised machine learning. In this section, we need to move forward to prepare our independent variables, which are all the features representing the factors impacting our dependent variable: the sales team success. Specifically, for this important work, we need to reduce our four hundred of features to a reasonable group for final modeling. For this, we will employ PCA, utilize some subject knowledge, and then perform some feature selection tasks.
PCA
PCA is a very mature and also commonly used feature reduction method that is often used to find a small set of variables that counts for most of the variance. Technically, the goal of PCA is to find a low dimensional subspace that captures as much of the variance of a dataset as possible.
If you are using MLlib, http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html#principal-component-analysis-pca has a few...