Introduction to Apache Spark
Apache Spark is an open-sourced data analytics engine that is used for data processing. The most popular use case is ETL. As an introduction to Spark, we will cover the key concepts surrounding Spark and some common Spark operations. Specifically, we will start by introducing resilient distributed datasets (RDDs) and DataFrames. Then, we will discuss Spark basics that you need to know about for ETL tasks: how to load a set of data from data storage, apply various transformations, and store the processed data. Spark applications can be implemented using multiple programming languages: Scala, Java, Python, and R. In this book, we will use Python so that we are aligned with the other implementations. The code snippets in this section can be found in this book’s GitHub repository: https://github.com/PacktPublishing/Production-Ready-Applied-Deep-Learning/tree/main/Chapter_5/spark. The datasets we will use in our examples include Google Scholar and the...