3. Data Preparation
Activity 3.01: Using PySpark for a Simple ETL Job to Find Netflix Shows for All Ages
Solution
- Create a directory called
Activity03.01
in theChapter03
directory to store the files for this activity. - Open your Terminal (macOS or Linux) or Command Prompt (Windows), navigate to the
Chapter03
directory, and typejupyter notebook
. - Select the
Activity03.01
directory, then clickNew
->Python3
to create a new Python 3 notebook. - If you have done Exercise 3.02, Building an ETL Job Using Spark, PySpark is already installed on your local machine. If not, install PySpark with the following lines:
import sys !conda install --yes --prefix {sys.prefix} \ -c conda-forge pyspark
- Connect to a Spark cluster or a local instance using the following code:
from pyspark.sql import SparkSession from pyspark.sql.functions import col, split, size spark = SparkSession.builder.appName("Packt").getOrCreate()
...