Sampling and balancing classification data
Sampling data can be used to shrink the data size for code development or to balance the classes between the dataset. We can also use synthetic sampling techniques, such as SMOTE and ADASYN. We'll start with the simplest methods, which are downsampling and naive upsampling.
Downsampling
To simply shrink the size of our dataset while preserving the class balance, we can use train_test_split
:
_, x_sample, _, y_sample = train_test_split(features,
targets,
test_size=0.1,
stratify=targets,
random_state=42)
The stratify
argument is key here so that our targets retain the same balance. We can confirm that the class balance has been retained with np.bincount(y_sample) / y_sample.shape[0]
and train_targets.value_counts(normalize=True)
, which calculate the...