Chapter 8: Scaling out Python Using Clusters
In the previous chapter, we discussed parallel processing for a single machine using threads and processes. In this chapter, we will extend our discussion of parallel processing from a single machine to multiple machines in a cluster. A cluster is a group of computing devices that work together to perform compute-intensive tasks such as data processing. In particular, we will study Python's capabilities in the area of data-intensive computing. Data-intensive computing typically uses clusters for processing large volumes of data in parallel. Although there are quite a few frameworks and tools available for data-intensive computing, we will focus on Apache Spark as a data processing engine and PySpark as a Python library to build such applications.
If Apache Spark with Python is properly configured and implemented, the performance of your application can increase manyfold and surpass competitor platforms such as Hadoop MapReduce....