Developing ML pipelines with Glue
The combination of SageMaker’s model-hosting features and libraries, plus Glue’s data preparation and orchestration features, allow you to create complex and highly-configurable ML pipelines. In this architecture, each service is responsible for different roles:
- Glue handles data handling and orchestration. Data handling includes extraction, processing, preparation, and storage. Orchestration refers to the overall execution of the pipeline itself.
- SageMaker handles all ML-related tasks such as model creation, training, and hosting.
Several components are critical to this, as follows:
- Glue workflows are the main form of orchestration in Glue. Workflows allow users to define graph-based chains of crawlers, ETL jobs, and triggers, and to see their execution visually in the web console.
- Python Shell jobs are a sub-class of Glue ETL jobs that are designed to run plain Python scripts instead of PySpark ones. They...