PySpark API
We have been using the PySpark API across all sections when describing the features of Azure Databricks without discussing too much of its functionalities and how we can leverage them to make reliable ETL operations when working with big data. PySpark is the Python API for Apache Spark, a cluster-computing framework that is the heart of Azure Databricks.
Main functionalities of PySpark
PySpark allows you to harness the power of distributed computing with the ease of use of Python and it's the default way in which we express our computations through this book unless stated otherwise.
The fundamentals of PySpark lies in the functionality of its sub-packages of which the most central are the following:
- PySpark DataFrames: Data stored in rows following a set of named columns. These DataFrames are immutable and allow us to perform lazy computations.
- The PySpark SQL module: A higher-abstraction module for processing structured and semi-structured datasets...