Using an empty aggregate to evaluate sample size
Having all the data made available is usually not a challenge to the data miner—the challenge is having enough of the right data. The data needs to be relevant to the business question, and be from an appropriate time period. Many users of Modeler might not realize that an Aggregate node can be useful even when all you have done is drag it into place, but have given no further instruction to Modeler.
At times data preparation requires the number of records in a dataset to be a data item that is to be used in further calculations. This recipe shows how to use the Aggregate node with no aggregation key and no aggregation operations to produce this count, and how to merge this count into every record using a Cartesian product so that it is available for further calculations.
Getting ready
This recipe uses the cup98lrn reduced vars2 empty.txt
data set. Since this recipe produces a fairly simple stream, we will build the stream from scratch.
How to do it...
To use an empty Aggregate node to evaluate sample size:
- Place a new Var. File source node on the canvas of a new stream. The file name is
cup98lrn reduced vars2.txt
. Confirm that the data is being accessed properly. - Add both an Aggregate node and a Table node downstream of the source. You do not need to edit either of the nodes.
- Run the stream and confirm the result. Total sample size is 95412.
- Now, add a Type node and a Distinct node in between the Source and Aggregate node. Move the variable
CUST_ID
into the Key fields for grouping box. - Select Discard only the first record in each group.
- Run it and confirm that the result is 0. You have learned that there are no duplicates at the customer level.
- Place a Merge node so that it is combining the original source with the output of an empty Aggregate.
- Within the Merge node choose Full Outer Join.
- You have just successfully added the total sample size to the data set where it can be used for further calculation, as needed.
How it works...
What an Aggregate node typically does is use a categorical variable to define a new row—always a reduction in the number of rows. Scale variables can be in the Aggregate field's area and summary statistics are calculated. Average sales in columns arranged with regions in rows would be a typical example. Having given none of these instructions, the Aggregate node boils our data down to a single row. Having given it no summary statistics to report all, what it does is the default instructions, namely Include record count in field, which is checked off at the bottom of the Aggregate node's menu. While this recipe is quite easy, this default behavior is sometimes surprising to new users.
There's more...
Now let's talk about some other options, or possibly some pieces of general information that are relevant to this task.
If you are merging many sources of data, as will often be the case, you should check sample size for each source, and for the combined sources as well. If you obtained the data from a colleague, you should be able to confirm that the sample size and the absence (or presence) of duplicate IDs was consistent with expectations.
When duplicates are present, and you therefore get a non-zero count, you can remove the aggregate and export the duplicates. You will get the second row (or third, or even more) of each duplicate. You can look up those IDs and verify that they should (or should not) be in the data set.
A modified version
A modified version of this technique can be helpful when you have a nominal variable with lots of categories such as STATE
. Simply make the variable your key field.
Additionally, it is wise to sort on Record_Count
with a Sort node (not shown). The results show us that California has enough donors that we might be able to compare California to other states, but the data in New England is thin. Perhaps we need to group those states into a broader region variable.
The same issue can arise in other data sets with any variable of this kind, such as product category, or sales district, etc. In some cases, you may conclude that certain categories are out of the scope of the analysis. That is not likely in this instance, but there are times when you conclude that certain categories are so poorly represented that they warrant a separate analysis. Only the business problem can guide you; this is merely a method for determining what raw material you have to work with.
See also
- Chapter 4, Data Preparation – Construct