Reading JSON data with Apache Spark
In this recipe, we will learn how to ingest and load JSON data with Apache Spark. Finally, we will cover some common tasks in data engineering with JSON data.
How to do it...
- Import libraries: Import the required libraries and create a
SparkSession
object:from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = (SparkSession.builder
.appName("read-json-data")
.master("spark://spark-master:7077")
.config("spark.executor.memory", "512m")
.getOrCreate())
spark.sparkContext.setLogLevel("ERROR")
- Load the JSON data into a Spark DataFrame: The
read
method of theSparkSession
object can be used to load JSON data from a file or a directory. ThemultiLine
option is set totrue
to parse records that span multiple lines. We need to pass the path to the JSON file as a parameter:df = ...