Introduction to data preparation
Data preparation can be defined as the process of sanitizing and normalizing the dataset using a combination of transformations to prepare the data for downstream consumers. In a typical data integration workflow, prepared data is consumed by analytics applications, visualization tools, and machine learning pipelines. It is not uncommon for the prepared data to be ingested by other data processing pipelines, depending on the requirements of the consuming entity.
When we consider a typical data integration workflow, quite often, data preparation is one of the more challenging and time-consuming tasks. It is important to ensure the data is prepared correctly according to the requirements as this impacts the subsequent steps in the data integration workflow significantly.
The complexity of the data preparation process depends on several factors, such as the schema of the source data, schema drift, the volume of data, the transformations to be applied...