Introduction
What makes Spark one of the most popular analytics engines? How did Spark evolve to become the parallel processing engine of choice? This chapter will help you get answers to these questions and more.
In the previous chapter, we learned about the various big data file formats, including Parquet, AVRO, and ORC, and how to use them. In this chapter, we will solve the challenge of processing large volumes of data that is dynamic, real-time, and grows exponentially in a short period of time. We will learn about systems that can read, write, and process data exponentially faster than sequential processing. This is facilitated by having parallel processing in clusters, which was the origin of Hadoop and MapReduce. Companies including eBay, Facebook, Twitter, and Google used Hadoop and MapReduce extensively. Later, they found that these systems were not designed for iterative machine learning paradigms. Machine learning models require several iterations to fine-tune the hyperparameters...