In the previous chapters, we looked at various aspects of the data analysis life cycle using Scala and some of the associated Scala libraries for data analysis. These libraries work well on a single machine; however, most of the real-world data is generally too big to fit into a single machine and requires distributed data processing on multiple machines. It is certainly possible to write distributed data processing code using Scala, but the complexity of handling failures rises significantly in a distributed environment. Fortunately, there are some robust and proven open source solutions that are available to facilitate distributed data processing on a large scale. One such open source solution is Apache Spark.
Apache Spark (https://spark.apache.org/) is a unified analytics engine that supports robust and reliable distributed...