Since you're here to learn how to solve a real-life problem in Scala, exploring available Scala libraries would be worthwhile. Unfortunately, we don't have many options except for the Spark MLlib and ML, which can be used for the regression analysis very easily and comfortably. Importantly, it has every regression analysis algorithm implemented as high-level interfaces. I assume that Scala, Java, and your favorite IDE such as Eclipse or IntelliJ IDEA are already configured on your machine. We will introduce some concepts of Spark without providing much detail, but we will continue learning in upcoming chapters too.
First, I'll introduce SparkSession, which is a unified entry point of a Spark application introduced from Spark 2.0. Technically, SparkSession is the gateway to interact with some of Spark's functionality with a few constructs such as SparkContext, HiveContext, and SQLContext, which are all encapsulated in a SparkSession. Previously, you have seen how to create such a session, probably without knowing it. Well, a SparkSession can be created as a builder pattern as follows:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder // the builder itself
.master("local[4]") // number of cores (i.e. 4, use * for all cores)
.config("spark.sql.warehouse.dir", "/temp") // Spark SQL Hive Warehouse location
.appName("SparkSessionExample") // name of the Spark application
.getOrCreate() // get the existing session or create a new one
The preceding builder will try to get an existing SparkSession or create a new one. Then the newly created SparkSession will be assigned as the global default.
By the way, when using spark-shell, you don't need to create a SparkSession explicitly, because it's already created and accessible with the spark variable.
Creating a DataFrame is probably the most important task in every data analytics task. Spark provides a read() method that can be used to read data from numerous sources in various formats such as CSV, JSON, Avro, and JDBC. For example, the following code snippet shows how to read a CSV file and create a Spark DataFrame:
val dataDF = spark.read
.option("header", "true") // we read the header to know the column and structure
.option("inferSchema", "true") // we infer the schema preserved in the CSV
.format("com.databricks.spark.csv") // we're using the CSV reader from DataBricks
.load("data/inputData.csv") // Path of the CSV file
.cache // [Optional] cache if necessary
Once a DataFrame is created, we can see a few samples (that is, rows) by invoking the show() method, as well as print the schema using the printSchema() method. Invoking describe().show() will show the statistics about the DataFrame:
dataDF.show() // show first 10 rows
dataDF.printSchema() // shows the schema (including column name and type)
dataDF.describe().show() // shows descriptive statistics
In many cases, we have to use the spark.implicits._ package, which is one of the most useful imports. It is handy, with a lot of implicit methods for converting Scala objects to datasets and vice versa. Once we have created a DataFrame, we can create a view (temporary or global) for performing SQL using either the ceateOrReplaceTempView() method or the createGlobalTempView() method, respectively:
dataDF.createOrReplaceTempView("myTempDataFrame") // create or replace a local temporary view with dataDF
dataDF.createGlobalTempView("myGloDataFrame") // create a global temporary view with dataframe dataDF
Now a SQL query can be issued to see the data in tabular format:
spark.sql("SELECT * FROM myTempDataFrame")// will show all the records
To drop these views, spark.catalog.dropTempView("myTempDataFrame") or spark.catalog.dropGlobalTempView("myGloDataFrame"), respectively, can be invoked. By the way, once you're done simply invoking the spark.stop() method, it will destroy the SparkSession and all the resources allocated by the Spark application. Interested readers can read detailed API documentation at https://spark.apache.org/ to get more information.