The data cleansing process
The data process is built around identifying the records that are useful for the intended purpose and enriching the dataset with any fields that might be valuable. We can achieve this in two ways:
- By modifying the existing dataset
- Or by adding additional data to the dataset
Of those two options, adding additional data is effectively just an extension of modifying the dataset by combining multiple data pipelines into a single, cohesive pipeline.
When modifying the existing dataset, four primary processes provide an umbrella for the transformations:
- Selecting the columns of interest
- Filtering the relevant rows
- Creating and modifying columns with formulas
- Summarizing the dataset to a more relevant level of granularity
Each of these steps focuses on transforming the dataset according to your use case and solving your data question.
Selecting columns
Selecting the relevant columns in a dataset is achieved...