Exploring the capabilities of Amazon SageMaker
Amazon SageMaker was launched at AWS re:Invent 2017. Since then, a lot of new features have been added: you can see the full (and ever-growing) list at https://aws.amazon.com/about-aws/whats-new/machine-learning.
In this section, you'll learn about the main capabilities of Amazon SageMaker and its purpose. Don't worry, we'll dive deep into each of them in later chapters. We will also talk about the SageMaker Application Programming Interfaces (APIs), and the Software Development Kits (SDKs) that implement them.
The main capabilities of Amazon SageMaker
At the core of Amazon SageMaker is the ability to prepare, build, train, optimize, and deploy models on fully managed infrastructure at any scale. This lets you focus on studying and solving the machine learning problem at hand, instead of spending time and resources on building and managing infrastructure. Simply put, you can go from building to training to deploying more quickly. Let's zoom in on each step and highlight relevant SageMaker capabilities.
Preparing
Amazon SageMaker includes powerful tools to label and prepare datasets:
- Amazon SageMaker Ground Truth: Annotate datasets at any scale. Workflows for popular use cases are built in (image detection, entity extraction, and more), and you can implement your own. Annotation jobs can be distributed to workers that belong to private, third-party, or public workforces.
- Amazon SageMaker Processing: Run batch jobs for data processing (and other tasks such as model evaluation) using your own code written with scikit-learn or Spark.
- Amazon SageMaker Data Wrangler: Using a graphical interface, apply hundreds of built-in transforms (or your own) to tabular datasets, and export them in one click to a Jupyter notebook.
- Amazon SageMaker Feature Store: Store your engineered features offline in Amazon S3 to build datasets, or online to use them at prediction time.
- Amazon SageMaker Clarify: Using a variety of statistical metrics, analyze potential bias present in your datasets and models, and explain how your models predict.
Building
Amazon SageMaker provides you with two development environments:
- Notebook instances: Fully managed Amazon EC2 instances that come preinstalled with the most popular tools and libraries: Jupyter, Anaconda, and so on.
- Amazon SageMaker Studio: An end-to-end integrated development environment for machine learning projects, providing an intuitive graphical interface for many SageMaker capabilities. Studio is now the preferred way to run notebooks, and we recommend that you use it instead of notebook instances.
When it comes to experimenting with algorithms, you can choose from the following:
- A collection of 17 built-in algorithms for machine learning and deep learning, already implemented and optimized to run efficiently on AWS. No Machine learning code to write!
- A collection of built-in, open source frameworks (TensorFlow, PyTorch, Apache MXNet, scikit-learn, and more), where you simply bring your own code.
- Your own code running in your own container: custom Python, R, C++, Java, and so on.
- Algorithms and pre-trained models from AWS Marketplace for machine learning (https://aws.amazon.com/marketplace/solutions/machine-learning).
- Machine learning solutions and state-of-the-art models available in one click in Amazon SageMaker JumpStart.
In addition, Amazon SageMaker Autopilot uses AutoMachine learning to automatically build, train, and optimize models without the need to write a single line of Machine learning code.
Training
As mentioned earlier, Amazon SageMaker takes care of provisioning and managing your training infrastructure. You'll never spend any time managing servers, and you'll be able to focus on machine learning instead. On top of this, SageMaker brings advanced capabilities such as the following:
- Managed storage using either Amazon S3, Amazon EFS, or Amazon FSx for Lustre depending on your performance requirements.
- Managed spot training, using Amazon EC2 Spot instances for training in order to reduce costs by up to 80%.
- Distributed training automatically distributes large-scale training jobs on a cluster of managed instances, using advanced techniques such as data parallelism and model parallelism.
- Pipe mode streams infinitely large datasets from Amazon S3 to the training instances, saving the need to copy data around.
- Automatic model tuning runs hyperparameter optimization to deliver high-accuracy models more quickly.
- Amazon SageMaker Experiments easily tracks, organizes, and compares all your SageMaker jobs.
- Amazon SageMaker Debugger captures the internal model state during training, inspects it to observe how the model learns, detects unwanted conditions that hurt accuracy, and profiles the performance of your training job.
Deploying
Just as with training, Amazon SageMaker takes care of all your deployment infrastructure, and brings a slew of additional features:
- Real-time endpoints create an HTTPS API that serves predictions from your model. As you would expect, autoscaling is available.
- Batch transform uses a model to predict data in batch mode.
- Amazon Elastic Inference adds fractional GPU acceleration to CPU-based endpoints to find the best cost/performance ratio for your prediction infrastructure.
- Amazon SageMaker Model Monitor captures data sent to an endpoint and compares it with a baseline to identify and alert on data quality issues (missing features, data drift, and more).
- Amazon SageMaker Neo compiles models for a specific hardware architecture, including embedded platforms, and deploys an optimized version using a lightweight runtime.
- Amazon SageMaker Edge Manager helps you deploy and manage your models on edge devices.
- Last but not least, Amazon SageMaker Pipelines lets you build end-to-end automated pipelines to run and manage your data preparation, training, and deployment workloads.
The Amazon SageMaker API
Just like all other AWS services, Amazon SageMaker is driven by APIs that are implemented in the language SDKs supported by AWS (https://aws.amazon.com/tools/). In addition, a dedicated Python SDK, aka the SageMaker SDK is also available. Let's look at both, and discuss their respective benefits.
The AWS language SDKs
Language SDKs implement service-specific APIs for all AWS services: S3, EC2, and so on. Of course, they also include SageMaker APIs, which are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/api-and-sdk-reference.htmachine learning.
When it comes to data science and machine learning, Python is the most popular language, so let's take a look at the SageMaker APIs available in boto3
, the AWS SDK for the Python language (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.htmachine learning). These APIs are quite low-level and verbose: for example, create_training_job()
has a lot of JSON parameters that don't look very obvious. You can see some of them in the next screenshot. You may think that this doesn't look very appealing for everyday Machine learning experimentation… and I would totally agree!
Indeed, these service-level APIs are not meant to be used for experimentation in notebooks. Their purpose is automation, through either bespoke scripts or Infrastructure as Code tools such as AWS CloudFormation (https://aws.amazon.com/cloudformation) and Terraform (https://terraform.io). Your DevOps team will use them to manage production, where they do need full control over each possible parameter.
So, what should you use for experimentation? You should use the Amazon SageMaker SDK.
The Amazon SageMaker SDK
The Amazon SageMaker SDK (https://github.com/aws/sagemaker-python-sdk) is a Python SDK specific to Amazon SageMaker. You can find its documentation at https://sagemaker.readthedocs.io/en/stable/.
Note
Every effort has been made to check the code examples in this book with the latest SageMaker SDK (v2.58.0 at the time of writing).
Here, the abstraction level is much higher: the SDK contains objects for models, estimators, models, predictors, and so on. We're definitely back in Machine learning territory.
For instance, this SDK makes it extremely easy and comfortable to fire up a training job (one line of code) and to deploy a model (one line of code). Infrastructure concerns are abstracted away, and we can focus on Machine learning instead. Here's an example. Don't worry about the details for now:
# Configure the training job my_estimator = TensorFlow( entry_point='my_script.py', role=my_sagemaker_role, train_instance_type='machine learning.p3.2xlarge', instance_count=1, framework_version='2.1.0') # Train the model my_estimator.fit('s3://my_bucket/my_training_data/') # Deploy the model to an HTTPS endpoint my_predictor = my_estimator.deploy( initial_instance_count=1, instance_type='machine learning.c5.2xlarge')
Now that we know a little more about Amazon SageMaker, let's see how we can set it up.