Chapter 3: Data Labeling with Amazon SageMaker Ground Truth
One of the biggest barriers to ML projects in most companies is access to labeled training data. At one company we worked with, we were trying to identify consumer-impacting outages. The customer had a lot of data from each layer of their application stack, but they couldn't agree on how to define an outage. Is an outage when a load balancer is down? Probably not – we have redundancy in the infrastructure layer. Is an outage when a customer can't access the service for over 10 minutes? That's probably too granular; a single customer might have problems due to local network connectivity issues. So, what exactly do we mean by an outage? How can we automatically label our training data as outage or not an outage?
In this chapter, we'll review labeling data using SageMaker Ground Truth. We'll cover common challenges associated with large datasets and potentially biased data.
The following...