Sources of bias
There are different sources of bias in a machine learning life cycle. Bias could exist in the collected data, introduced in the data subsampling, cleaning and filtering, or model training and selection. Here, we will review examples of such sources to help you better understand how to avoid or detect such biases throughout the life cycle of a machine learning project.
Biases introduced in data generation and collection
The data that we feed into our models could be biased by default, even before the modeling starts. The first source of such biases we want to review here is the issue of dataset size. Consider a dataset as a sample of a bigger population – for example, a survey of 100 students or the loan application information of 200 customers of a bank. The small size of these datasets could increase the chance of bias. Let’s simulate this with a simple random data generation. We will write a function that generates two vectors of random binary...