Applying sampling techniques to address class imbalance
Let's look at a scenario where there's a cell culture dataset that is being analyzed using machine learning (ML) algorithms to predict the onset of cancer. Most cells are normal; a small percentage may be abnormal. The two primary classes here are "normal" and "abnormal." This is an imbalanced dataset. This applies to multi-class datasets as well. An imbalance occurs when one or more classes have low proportions in the training data compared to other classes. Since the ML process involves "learning" from the dataset, there is a lot to learn about the normal scenarios and very little about the cancer ones. Most ML algorithms for classification are designed and demonstrated on problems that assume an equal distribution of classes and are designed to maximize accuracy and reduce error. The consequence of this imbalanced dataset is that the model is biased. Sometimes, it goes undetected and...