In this section, we will consider the aspects involved in tuning the clusters we use for ML. When you launch an EMR cluster, you can specify the different applications you want to run.
The following screenshot shows the applications available in EMR version 5.23.0:
Upon launching an EMR cluster, these are the most relevant items that need to be configured:
- Applications: Applications such as Spark.
- Hardware: We covered this in Chapter 10, Creating Clusters on AWS.
- Use of the Glue Data Catalog: We'll cover this in the last section of this chapter, Managing data pipelines with Glue.
- Software configuration: These are properties that we can specify to configure application-specific properties. In the next section, Configuring application properties, we'll show you how to customize the behavior of Spark through specific properties...