Similarity and distance metrics
The first step in clustering any new dataset is to decide how to compare the similarity (or dissimilarity) between items. Sometimes the choice is dictated by what kinds of similarity we are most interested in trying to measure, in others it is restricted by the properties of the dataset. In the following sections we illustrate several kinds of distance for numerical, categorical, time series, and set-based data—while this list is not exhaustive, it should cover many of the common use cases you will encounter in business analysis. We will also cover normalizations that may be needed for different data types prior to running clustering algorithms.
Numerical distance metrics
Let us begin by exploring the data in the wine.data
file. It contains a set of chemical measurements describing the properties of different kinds of wines, and the quality level (I-III) to which the wines are assigned (Forina, M., et al. PARVUS An Extendible Package for Data Exploration...