Initial cleansing of datasets
We now have an initial dataset, which we can keep as a raw dataset but doesn't otherwise provide a source that our end users can use to extract value. For example, when we investigate the public consumer price inflation dataset (which we downloaded in the Integrating public data sources with Download tool use section), all the fields are text fields because the reference is a CSV text file. In contrast, the Google Places API data is a complete JSON file but not arranged into usable tables. In both situations, applying any statistical process controls (SPCs) is difficult as the data type doesn't allow for the appropriate statistical measure.
To cleanse our dataset, we will use a generic cleansing process. We will take the concepts needed to cleanse a dataset and apply them to our example raw file.
A simple cleansing process
The preview of our dataset in Figure 4.15 shows four initial problems that we want to address:
- Titles...