In this section, we step through the creation of a clustering model capable of grouping consumer patterns into three distinct clusters. The first step will be to launch an EMR notebook, along with a small cluster (a single m5.xlarge node works fine, as the dataset we selected is not very large). Simply follow these steps:
- The first step is to load the dataframe and inspect the dataset:
df = spark.read.csv(SRC_PATH + 'data.csv',
header=True,
inferSchema=True)
The following screenshot shows the first few lines of our df dataframe:
As you can see, the dataset involves transactions of products bought by different customers at different times and in different locations. We attempt to cluster these customer transactions using k-means by looking at three factors:
- The product (represented by the...