Detecting bias in datasets and explaining predictions with SageMaker Clarify
A machine learning (ML) model is only as good as the dataset it was built from. If a dataset is inaccurate or unfair in representing the reality it's supposed to capture, a corresponding model is very likely to learn this biased representation and perpetuate it in its predictions. As ML practitioners, we need to be aware of these problems, understand how they impact predictions, and limit that impact whenever possible.
In this example, we'll work with the Adult Data Set, available at the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml, Dua, D. and Graff, C., 2019). This dataset describes a binary classification task, where we try to predict if an individual earns less or more than $50,000 per year. Here, we'd like to check whether this dataset introduces gender bias or not. In other words, does it help us build models that predict equally well for men and women?
Note
...