Chapter 5. Data Analysis on Spark
The field of data analytics at scale has been evolving like never before. Various libraries and tools were developed for data analysis with a rich set of algorithms. On a parallel line, distributed computing techniques were evolving with time, to process huge datasets at scale. These two traits had to converge, and that was the primary intention behind the development of Spark.
The previous two chapters outlined the technology aspects of data science. It covered some fundamentals on the DataFrame API, Datasets, streaming data and how it facilitated data representation through DataFrames that R and Python users were familiar with. After introducing this API, we saw how operating on datasets became easier than ever. We also looked at how Spark SQL played a background role in supporting the DataFrame API with its robust features and optimization techniques. In this chapter, we are going to cover the scientific aspect of big data analysis and...