DBSCAN
Another clustering method that can work well for strange cluster shapes is DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise. The algorithm is completely different from k-means or hierarchical clustering. With DBSCAN, our clusters are composed of core points and non-core points. Core points are all within a distance, epsilon (eps
in the sklearn
parameters), of at least n points in the same cluster (n is the min_samples
parameter in the sklearn
function). Then, any other points within the distance epsilon of the core points are also in the cluster. If any points are not within the epsilon distance of any core points, these are outliers. This algorithm assumes we have some dead space between samples, so our clusters must have at least some separation. We can also tune the eps
and min_samples
hyperparameters to optimize clustering metrics.
The min_samples
hyperparameter should generally be between the number of features and two times the...