Implementing a Spark ML classification model
The first step in implementing a machine learning is to perform EDA on input data. This analysis would typically involve data visualization using tools such as Zeppelin, assessing feature types (numeric/categorical), computing basic statistics, computing covariances, and correlation coefficients, creating pivot tables, and so on (for more details on EDA, see Chapter 3, Using Spark SQL for Data Exploration).
The next step involves executing data pre-processing and/or data munging operations. In almost all cases, the real-world input data will not be high quality data ready for use in a model. There will be several transformations required to convert the features from the source format to final variables; for example, categorical features may need to be transformed to a binary variable for each categorical value using one-hot encoding technique (for more details on data munging, see Chapter 4, Using Spark SQL for Data Munging).
Next is the feature...