Normalizing data with Spark
In this recipe, we normalizing (scaling) the data prior to importing the into an ML algorithm. There are a good number of ML algorithms such as Support Vector Machine (SVM) that work with scaled input vectors rather than with the raw values.
How to do it...
- Go to the UCI Machine Learning Repository and download the http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data file.
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
- Import the necessary packages for the Spark session to gain access to the cluster and
log4j.Logger
to reduce the amount of output produced by Spark:
import org.apache.spark.sql.SparkSession import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.feature.MinMaxScaler
- Define a method to parse wine data into a tuple:
def parseWine(str: String): (Int, Vector...