In this section, we'll dive into SageMaker's data preparation and feature engineering capabilities. By the end of this section, you should understand when to use SageMaker Ground Truth, Data Wrangler, Processing, Feature Store, and Clarify.
SageMaker Ground Truth
Obtaining labeled data for classification, regression, and other tasks is often the biggest barrier to ML projects, as many companies have a lot of data but have not explicitly labeled it according to business properties such as anomalous and high lifetime value. SageMaker Ground Truth helps you systematically label data by defining a labeling workflow and assigning labeling tasks to a human workforce.
Over time, Ground Truth can learn how to label data automatically, while still sending low-confidence results to humans for review. For advanced datasets such as 3D point clouds, which represent data points like shape coordinates, Ground Truth offers assistive labeling features, such as adding bounding boxes to the middle frames of a sequence once you label the start and end frames. The following diagram shows an example of labels applied to a dataset:
Figure 1.4 – SageMaker Ground Truth showing the labels applied to sentiment reviews
The data is sourced from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences). To counteract individual worker bias or error, a data object can be sent to multiple workers. In this example, we only have one worker, so the confidence score is not used.
Note that you can also use Ground Truth in other phases of the ML life cycle; for example, you may use it to check the labels generated by a production model.
SageMaker Data Wrangler
Data Wrangler helps you understand your data and perform feature engineering. Data Wrangler works with data stored in S3 (optionally accessed via Athena) and Redshift and performs typical visualization and transformations, such as correlation plots and categorical encoding. You can combine a series of transformations into a data flow and export that flow into an MLOps pipeline. The following screenshot shows an example of Data Wrangler information for a dataset:
Figure 1.5 – Data Wrangler displaying summary table information regarding a dataset
You may also use Data Wrangler in the operations phase of the ML life cycle if you want to analyze the data coming into an ML model for production inference.
SageMaker Processing
SageMaker Processing jobs help you run data processing and feature engineering tasks on your datasets. By providing your own Docker image containing your code, or using a pre-built Spark or sklearn container, you can normalize and transform data to prepare your features. The following diagram shows the logical flow of a SageMaker Processing job:
Figure 1.6 – Conceptual overview of a Spark processing job. Spark jobs are particularly handy for processing larger datasets
You may also use processing jobs to evaluate the performance of ML models during the Model Training phase and to check data and model quality in the Model Operations phase.
SageMaker Feature Store
SageMaker Feature Store helps you organize and share your prepared features. Using a feature store improves quality and saves time by letting you reuse features rather than duplicate complex feature engineering code and computations that have already been done. Feature Store supports both batch and stream storage and retrieval. The following screenshot shows an example of feature group information:
Figure 1.7 – Feature Store showing a feature group with a set of related features
Feature Store also helps during the Model Operations phase, as you can quickly look up complex feature vectors to help obtain real-time predictions.
SageMaker Clarify
SageMaker Clarify helps you understand model behavior and calculate bias metrics from your model. It checks for imbalance in the dataset, models that give different results based on certain attributes, and bias that appears due to data drift. It can also use leading explainability algorithms such as SHAP to help you explain individual predictions to get a sense of which features drive model behavior. The following figure shows an example of class imbalance scores for a dataset, where we have many more samples from the Gift Card category than the other categories:
Figure 1.8 – Clarify showing class imbalance scores in a dataset. Class imbalance can lead to biased results in an ML model
Clarify can be used throughout the entire ML life cycle, but consider using it early in the life cycle to detect imbalanced data (datasets that have many examples of one class but few of another).
Now that we've introduced several SageMaker capabilities for data preparation, let's move on to model-building capabilities.