Chapter 4: Data Preparation at Scale Using Amazon SageMaker Data Wrangler and Processing
So far, we've identified our dataset and explored both manual and automated labeling. Now it's time to turn our attention to preparing the data for training. Data scientists are familiar with the steps of feature engineering, such as scaling numeric features, encoding categorical features, and dimensionality reduction.
As motivation, let's consider our weather dataset. What if our input dataset is imbalanced or not really representative of the data we'll encounter in production? Our model will not be as accurate as we'd like, and the consequences can be profound. Some facial recognition systems have been trained on datasets weighted toward white faces, with distressing consequences (https://sitn.hms.harvard.edu/flash/2020/racial-discrimination-in-face-recognition-technology/?web=1&wdLOR=cB09A9880-DF39-442C-A728-B00E70AF1CA9).
We need to understand what input...