Chapter 4: Preparing, Processing, and Analyzing the Data
Before we can start training our machine learning model, we have to prepare, process, and transform our data into a structure and format that the algorithm can work on. There are different techniques and services we can use to handle our different data processing and analysis requirements. The recipes in this chapter focus on key SageMaker capabilities, algorithms, and features when performing these tasks. These include SageMaker Processing for our managed data processing and transformation requirements, support for invoking deployed SageMaker machine learning models with Amazon Athena to analyze our data with SQL statements, the built-in Principal Component Analysis (PCA) algorithm for performing dimensionality reduction, and the built-in KMeans algorithm for performing cluster analysis.
We will start with a gentle introduction to Amazon Athena and we will use it to help us process and analyze our large datasets and files...