Chapter 14: Data Processing with Apache Spark
In the previous chapter, you learned how to add streaming data to your data pipelines. Using Python or Apache NiFi, you can extract, transform, and load streaming data. However, to perform transformations on large amounts of streaming data, data engineers turn to tools such as Apache Spark. Apache Spark is faster than most other methods – such as MapReduce on non-trivial transformations – and it allows distributed data processing.
In this chapter, we're going to cover the following main topics:
- Installing and running Spark
- Installing and configuring PySpark
- Processing data with PySpark