15.5 Clustering
Suppose I have a CSV dataset containing 75 (x, y) geometric
coordinates. I load these into the xy_df
pandas
DataFrame and look at its descriptive statistical summary:
xy_df = pd.read_csv("src/examples/clustering-xy.csv")
xy_df.describe()
x y
count 75.0000 75.0000
mean 7.5733 4.5401
std 4.0102 2.1265
min 1.9796 1.1947
25% 3.4182 2.8896
50% 7.0173 3.6819
75% 12.2170 6.9615
max 13.5643 8.2785
Here is the usual sample of the first five points:
xy_df.head()
x y
0 13.4832 3.2657
1 7.6388 7.0170
2 2.9279 2.9603
3 7.4514 6.4439
4 3.3011 2.4642
How are these points spread out geometrically? Are they uniformly distributed within their minimum and maximum ranges?
A scatter plot would help us see the distribution because we are in two dimensions, but let’s try to collect or cluster the points into k groups first. Here, k is...