Standardizing Data
You've already learned a lot about the k-means algorithm, and we are close to the end of this chapter. In this final section, we will not talk about another hyperparameter (you've already been through the main ones) but a very important topic: data processing.
Fitting a k-means algorithm is extremely easy. The trickiest part is making sure the resulting clusters are meaningful for your project, and we have seen how we can tune some hyperparameters to ensure this. But handling input data is as important as all the steps you have learned about so far. If your dataset is not well prepared, even if you find the best hyperparameters, you will still get some bad results.
Let's have another look at our ATO dataset. In the previous section, Calculating the Distance to the Centroid, we found three different clusters, and they were mainly defined by the 'Average net tax'
variable. It was as if k-means didn't take into account the second...