This chapter will introduce the main clustering algorithms by exploring how to apply them to customer segmentation based on their behavioral patterns. In particular, we will demonstrate how Apache Spark and Amazon SageMaker can seamlessly interoperate to perform clustering. Throughout this chapter, we will be using the Kaggle Dataset E-Commerce data from Fabien Daniel, which can be downloaded from https://www.kaggle.com/fabiendaniel/customer-segmentation/data.
Let's take a look at the topics we will be covering:
- Understanding how clustering algorithms work
- Clustering with Apache Spark on Elastic MapReduce (EMR)
- Clustering using SageMaker through Spark integration