Applying schemas to analytical data
In the previous chapter, we saw how to apply schemas to structured and unstructured data, but the application of a schema is not limited to raw files.
Even when working with already processed data, there will be cases when we need to cast the values of a column or change column names to be used by another department. In this recipe, we will learn how to apply a schema to Parquet files and how it works.
Getting ready
We will need SparkSession
for this recipe. Ensure you have a session that is up and running. We will use the same dataset as in the Ingesting Parquet files recipe.
Feel free to execute the code using a Jupyter notebook or your PySpark shell session.
How to do it…
Here are the steps to perform this recipe:
- Looking at our columns: As seen in the Ingesting Parquet files recipe, we can list the columns and their inferred data types. You can see the list as follows:
VendorID: long tpep_pickup_datetime...