Bisecting KMeans, the new kid on the block in Spark 2.0
In this recipe, we will download the glass dataset and try to identify and label each glass using a bisecting KMeans algorithm. The Bisecting KMeans is a hierarchical version of the K-Mean algorithm implemented in Spark using the BisectingKMeans()
API. While this algorithm is conceptually like KMeans, it can offer considerable speed for some use cases where the hierarchical path is present.
The dataset we used for this recipe is the Glass Identification Database. The study of the classification of types of glass was motivated by criminological research. Glass could be considered as evidence if it is correctly identified. The data can be found at NTU (Taiwan), already in LIBSVM format.
How to do it...
- We downloaded the prepared data file in LIBSVM from: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/glass.scale
The dataset contains 11 features and 214 rows.
- The original dataset and data dictionary is also available at...