Ingesting unstructured data with a well-defined schema and format
In the previous recipe, Importing unstructured data without schema, we read a JSON file without any schema or formatting application. This led us to an odd output, which could bring confusion and require additional work later in the data pipeline. While this example pertains specifically to a JSON file, it also applies to all other NoSQL or unstructured data that needs to be converted into analytical data.
The objective is to continue the last recipe and apply a schema and standard to our data, making it more legible and easy to process in the subsequent phases of ETL.
Getting ready
This recipe has the exact same requirements as the Importing unstructured data without a schema recipe.
How to do it…
We will perform the following steps to perform this recipe:
- Importing data types: As usual, let’s start by importing our data types from the PySpark library:
from pyspark.sql.types...