Working with Spark data
When working with large datasets, we sometimes need to rely on distributed resources to clean and manipulate our data. With Apache Spark, analysts can take advantage of the combined processing power of many machines. We will use PySpark, a Python API for working with Spark, in this recipe. We will also go over how to use PySpark tools to take a first look at our data, select parts of our data, and generate some simple summary statistics.
Getting ready
To run the code in this section, you need to get Spark running on your computer. If you have installed Anaconda, you can follow these steps to work with Spark:
- Install
Java
withconda install openjdk
. - Install
PySpark
withconda install pyspark
orconda install -c conda forge pyspark
. - Install
findspark
withconda install -c conda-forge findspark
.Note
Installation of PySpark can be tricky, particularly setting the necessary environment variables. While
findspark...