Setting up the Jupyter Notebook
The following steps are required before getting started with the exercises:
Import all the required modules and packages in the Jupyter notebook:
import findspark findspark.init() import pyspark import random
Now, use the following command to set up SparkContext:
from pyspark import SparkContext sc = SparkContext()
Similarly, use the following command to set up SQLContext in the Jupyter notebook:
from pyspark.sql import SQLContext sqlc = SQLContext(sc)
Note
Make sure you have the PySpark CSV reader package from the Databricks website (https://databricks.com/) installed and ready before executing the next command. If not, then download it using the following command:
pyspark –packages com.databricks:spark-csv_2.10:1.4.0
Read the Iris dataset from the CSV file into a Spark DataFrame:
df = sqlc.read.format('com.databricks.spark.csv').options(header = 'true', inferschema = 'true').load('/Users/iris.csv')
The output of the preceding command is as follows:
df.show(5)