ML on networks
Now that we have explored friendship data a bit, let’s see how clustering algorithm performance varies depending on whether or not we include structural information about the network. We’ll start by considering just student factors.
Clustering based on student factors
For our first attempt at clustering, we’ll focus on the dataset itself, which contains metadata regarding student demographics and social activities. One of the simplest clustering algorithms is k-means clustering, which partitions data iteratively to minimize within-cluster variance and maximize between-cluster variance. This means that students clustered together have more in common with students in that same cluster than with students in other clusters. K-means clustering is a simple algorithm that works well in most cases. However, one needs to specify the number of expected clusters, which is typically not known ahead of time. We’ll use a cluster size of 3
and assess...