Spark Streaming code walk-through
You can download the complete PySpark script from our GitHub repository. The following is a walk-through of the primary functions of the script.
The following getSparkSessionInstance()
function is a user-defined function that gets an existing SparkSession, instead of creating a duplicate instance within custom, user-defined functions:
# Get existing SparkSession def getSparkSessionInstance(sparkConf): if ("sparkSessionSingletonInstance" not in globals()): globals()["sparkSessionSingletonInstance"] = SparkSession.builder.config(conf=sparkConf).getOrCreate() return globals()["sparkSessionSingletonInstance"]
The following processRecords()
function is a user-defined function, which is being invoked by each RDD of the Kinesis stream to parse the records of the RDD and write to Amazon S3 in Parquet format with year
, month
, date
, and hour
partition columns:
# Process...