Importing unstructured data without a schema
As seen before, unstructured data or NoSQL is a group of information that does not follow a format, such as relational or tabular data. It can be presented as an image, video, metadata, transcripts, and so on. The data ingestion process usually involves a JSON file or a document collection, as we previously saw when ingesting data from MongoDB.
In this recipe, we will read a JSON file and transform it into a DataFrame without a schema. Although unstructured data is supposed to have a more flexible design, we will see some implications of not having any schema or structure in our DataFrame.
Getting ready…
Here, we will use the holiday_brazil.json
file to create the DataFrame. You can find it in the GitHub repository here: https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook.
We will use SparkSession
to read the JSON file and create a DataFrame to ensure the session is up and running.
All code can be...