Running a Statistics node on anti-join to evaluate the potential missing data
There is typically some data loss when various data tables are integrated. Although we won't discuss data integration until a later chapter, it is important to gauge what (and how much) is lost at this stage. Financial variables are usually aggregated in very different ways for the financial planner and the data miner. It is critical that the data miner periodically translate the data of the data miner back into the form that middle and senior management will recognize so that they can better communicate.
The data miner deals with transactions and individual customer data, the language of individual rows of data. The manager speaks, generally, the language of spreadsheets: regions, product lines, months rolled up into aggregated cells in Excel.
On a project, we once discovered that a small percentage of missing rows represented a larger fraction of revenue than average—much larger actually. We suddenly...