Cleansing data
Cleansing data is an important activity in any data pipeline. As data flows in from various sources, there are chances that the data won't adhere to the schemas and there might be missing values, non-standard entries, duplicates, and so on. During the cleansing phase, we try to correct such anomalies. Let's look at a few common data cleansing techniques.
Handling missing/null values
We can handle missing or Null
values in multiple ways. We can choose to filter out such rows, substitute missing values with default values, or substitute them with some meaningful values such as mean, median, average, and so on. Let's look at an example of how we can replace missing values with some default string such as 'NA'
.
Substituting with default values
Substituting with default values can be achieved using the Derived columns transformation, as shown in the following screenshot: