Splitting data for training and testing
In this recipe, you will learn to use Spark's API to split your available input data into different datasets that can be used for training and validation phases. It is common to use an 80/20 split, but other variations of splitting the data can be considered as well based on your preference.
How to do it...
- Go to the UCI Machine Learning Repository and download the http://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip file.
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- Set up the package location where the program will reside:
package spark.ml.cookbook.chapter4
- Import the necessary packages for the Spark session to gain access to the cluster and
log4j.Logger
to reduce the amount of output produced by Spark:
import org.apache.spark.sql.SparkSession import org.apache.log4j.{ Level, Logger}
- Set the output level to
ERROR
to reduce Spark's logging output:
Logger...