Let's now consider a more detailed problem on a larger dataset (the instructions to download it are provided in the Technical requirements section at the beginning of the chapter) containing 527 samples with 38 chemical and physical variables describing the status of water treatment plants. As the same authors (Bejar, Cortes, and Poch) stated, the domain is poorly-structured and careful analysis is needed. At the same time, our goal is to find the optimal clustering with an agnostic approach; in other words, we won't consider the semantic labeling process (which needs a domain expert) but only the geometrical structure of the dataset and the relations discovered by the agglomerative algorithm.
Once downloaded, the CSV file (called water-treatment.data) can be loaded using pandas (of course, the term <DATA_PATH...