Understanding the Spark Dataset API
Spark provides various APIs for interacting with data. They are powerful tools for building data engineering pipelines in Scala because you can use the functionality they provide without having to write those functions yourself. The first API we will work with is the Dataset API.
A Dataset is a type of object that is a collection of other objects called Rows. These Row objects have a structure and data types that hold the data we process. The Dataset rows can be processed in parallel on our Spark cluster, as explained previously. Explicitly defining a structure and data types of objects is called strong typing. Being strongly typed means that each column in your row data is associated with a specified data type for that column. Because Datasets are strongly typed, at compile time, they are checked for errors, which is better than finding out you have a data type problem at runtime! Strong typing means you have to put in a little work ahead of...