In this recipe, we are using numpy for data manipulation, sklearn for the machine learning algorithm, and matplotlib for viewing the results. Next, we pull the tab-separated file into a Spark dataframe. In this step, we convert the data into a pandas DataFrame. Then we run the k-means algorithm with three clusters, which gives the chart as the output.
K-means is an algorithm that helps group data into clusters. K-means is a popular clustering algorithm for examining data without labels. K-means first randomly initializes cluster centroids. In our example, it had three cluster centroids. It then assigns the centroids to the nearest data points. Next, it moves each centroid to the spot that is in the middle of its respective cluster. It repeats these steps until it achieves an appropriate division of data points.