Clustering for customer segmentation
Here, we will now build a program that will use the k-means clustering algorithm and will make five clusters from our transactional dataset.
Before we crunch the data to figure out the clusters, we have made a few important assumptions and deductions regarding the data to preprocess it:
- We are only going to do clustering for the data belonging to the United Kingdom. The reason being, most of the data belongs to the United Kingdom in this dataset.
- For any missing or null values, we will simply discard that row of data. This is to keep things simple, and also because we have a good amount of data available for analysis. Leaving a few rows should not have much impact.
Let's now start our program. We will first build our boilerplate code to build the SparkSession
and Spark configuration:
SparkConf conf = ... SparkSession session = ...
Next, let's load the data from the file into a dataset:
Dataset<Row> rawData = session.read().csv("data/retail...