Performing numerosity data reduction
When we need to reduce the number of data objects (rows) as opposed to the number of attributes (columns), we have a case of numerosity reduction. In this section, we will cover three methods: random sampling, stratified sampling, and random over/undersampling. Let's start with random sampling.
Random sampling
Randomly selecting some of the rows to be included in the analysis is known as random sampling. The reason we are compelled to accept random sampling is when we run into computational limitations. This normally happens when the size of our data is bigger than our computational capabilities. In those situations, we may randomly select a subset of the data objects to be included in the analysis. Let's look at an example.
Example – random sampling to speed up tuning
In this example, we are using Customer Churn.csv
to train a decision tree so that it can predict (classify) what customer will be churning in the future...