Chapter 5: Data Processing and Transformations
Now that we have our initial raw dataset, we can start transforming data into the final state. When building your data pipeline, this processing and transformation process is the core of the entire pipeline and often requires separation into multiple subsets for different applications.
The core data processing is the simplest part of this process, and it is what we started looking at in Chapter 4, Sourcing the Data, where we began the process of creating the pipeline by taking the raw data, cleansing the titles and information headers, and setting the data types. This just provides us with an initial dataset to work with, and not a final dataset for use. When we look at the column headers, we see three different datasets making up the columns. Additionally, the records are shown across multiple different time periods – annually, quarterly, and monthly.
Our next step will be to improve the dataset to provide a more relevant...