The following steps need to be observed to complete this recipe:
- Import the required libraries:
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
- Upload the data file to a DataFrame:
df = spark.read.format("csv" \
.option("inferSchema", True) \
.option("header", True) \
.option("sep", "\t") \
.load("/FileStore/tables/HT_Sensor_metadata.dat")
- View the dataset to see if the grouping of data correlates to the number of clusters:
pdf = df.toPandas()
y_pred = KMeans(n_clusters=3,
random_state=2).fit_predict(pdf[['dt','t0']])
plt.scatter(pdf['t0'],pdf['dt'], c=y_pred)
display(plt.show())
The output is as follows:
The preceding chart shows three different groups of data. Tight clusters represent data with well-defined boundaries. If we adjust the number of clusters to 10, we may be able to get better...