To get the most out of this book
Basic to intermediate knowledge of the disciplines of data engineering, data science, and SQL analytics is expected. A general level of proficiency using any programming language, especially Python, and a working knowledge of performing data analytics using frameworks such as pandas and SQL will help you to get the most out of this book.
The book makes use of Databricks Community Edition to run all code: https://community.cloud.databricks.com. Sign-up instructions can be found at https://databricks.com/try-databricks.
The entire code base used in this book can be downloaded from https://github.com/PacktPublishing/Essential-PySpark-for-Scalable-Data-Analytics/blob/main/all_chapters/ess_pyspark.dbc.
The datasets used for this chapter can be found at https://github.com/PacktPublishing/Essential-PySpark-for-Data-Analytics/tree/main/data.
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.