Scaling with Spark
Apache Spark came from the work of some brilliant researchers at the University of California, Berkeley in 2012 and since then, it has revolutionized how we tackle problems with large datasets. Before Spark, the dominant paradigm for big data was Hadoop MapReduce, which is a lot less popular now.
Spark is a cluster computing framework, which means it works on the principle that several computers are linked together in a way that allows computational tasks to be shared. This allows us to coordinate these tasks effectively. Whenever we discuss running Spark jobs, we always talk about the cluster we are running on. This is the collection of computers that perform the tasks, the worker nodes, and the computer that hosts the organizational workload, known as the head node.
Spark is written in Scala, a language with a strong functional flavor and that compiles down to Java Virtual Machines (JVMs). Since this is a book about ML engineering in Python, we don't...