GLM example with Spark and R on the HDInsight cluster
In the first practical example of this chapter we will use the HDInsight cluster with Spark and Hadoop, and run a Generalized Linear Model (GLM) on the flight data available for you to download from the Packt Publishing website created for this book.
Preparing the Spark cluster and reading the data from HDFS
Before carrying out any analytics on the data, let's firstly double-check whether you have all of the required resources in place. In this tutorial, we will be using the same multi-node HDInsight cluster that you previously deployed following the instructions in Chapter 7, Faster than Hadoop: Spark with R and specifically the section on Launching HDInsight with Spark and R/RStudio. If you don't remember how to launch the HDInsight cluster on Microsoft Azure, detailed step-by-step guidelines have been provided earlier in the HDInsight - A multi-node Hadoop cluster on Azure section in Chapter 4, Hadoop and MapReduce Framework for R. As...