An Introduction to Apache Spark and Its APIs – DataFrame, Dataset, and Spark SQL
Apache Spark is written in Scala and has become the dominant distributed data processing framework due to its ability to ingest, enrich, and prepare at-scale data for analytical use cases. As a data engineer, you will eventually have to work with data volumes that won’t be processable on a single machine. This chapter will teach you how to leverage Spark and its various APIs to do that processing on a cluster of machines.
In this chapter, we’re going to cover the following main topics:
- Working with Apache Spark
- Creating a Spark application using Scala
- Understanding the Spark Dataset API
- Understanding the Spark DataFrame API