There are three major components of the inference pipeline we are building:
- Data preprocessing
- Model training
- Data preprocessing (from Step 1) and inference
The following is the architectural diagram—the steps we are going to walk through are applicable to big data:
In the first step of the pipeline, we execute data processing logic on Apache Spark via AWS Glue. The Glue service is called from a SageMaker Notebook instance.
Amazon Glue is a fully managed, serverless Extract, Transform, and Load (ETL) service that's used to wrangle big data. ETL jobs are run on an Apache Spark environment where Glue provisions, the configuration and scale the resources that are required to run the jobs.
The data processing logic, in our case, includes creating tokens/words from each of the news headlines, removing...