11.2 Overall approach
For reference see Chapter 9, Project 3.1: Data Cleaning Base Application, specifically Approach. This suggests that the clean
module should have minimal changes from the earlier version.
A cleaning application will have several separate views of the data. There are at least four viewpoints:
The source data. This is the original data as managed by the upstream applications. In an enterprise context, this may be a transactional database with business records that are precious and part of day-to-day operations. The data model reflects considerations of those day-to-day operations.
Data acquisition interim data, usually in a text-centric format. We’ve suggested using ND JSON for this because it allows a tidy dictionary-like collection of name-value pairs, and supports quite complex Python data structures. In some cases, we may perform some summarization of this raw data to standardize scores. This data may be used to diagnose and debug problems with upstream...