Self-organizing maps
A SOM is a technique to generate topological representations of data in reduced dimensions. It is one of a number of techniques with such applications, with a better-known alternative being PCA. However, SOMs present unique opportunities, both as dimensionality reduction techniques and as a visualization format.
SOM – a primer
The SOM algorithm involves iteration over many simple operations. When applied at a smaller scale, it behaves similarly to k-means clustering (as we'll see shortly). At a larger scale, SOMs reveal the topology of complex datasets in a powerful way.
An SOM is made up of a grid (commonly rectangular or hexagonal) of nodes, where each node contains a weight vector that is of the same dimensionality as the input dataset. The nodes may be initialized randomly, but an initialization that roughly approximates the distribution of the dataset will tend to train faster.
The algorithm iterates as observations are presented as input. Iteration takes the following form:
- Identifying the winning node in the current configuration—the Best Matching Unit (BMU). The BMU is identified by measuring the Euclidean distance in the data space of all the weight vectors.
- The BMU is adjusted (moved) towards the input vector.
- Neighboring nodes are also adjusted, usually by lesser amounts, with the magnitude of neighboring movement being dictated by a neighborhood function. (Neighborhood functions vary. In this chapter, we'll use a Gaussian neighborhood function.)
This process repeats over potentially many iterations, using sampling if appropriate, until the network converges (reaching a position where presenting a new input does not provide an opportunity to minimize loss).
A node in an SOM is not unlike that of a neural network. It typically possesses a weight vector of length equal to the dimensionality of the input dataset. This means that the topology of the input dataset can be preserved and visualized through a lower-dimensional mapping.
The code for this SOM class implementation is available in the book repository in the som.py
script. For now, let's start working with the SOM algorithm in a familiar context.
Employing SOM
As discussed previously, the SOM algorithm is iterative, being based around Euclidean distance comparisons of vectors.
This mapping tends to form a fairly readable 2D grid. In the case of the commonly-used Iris tutorial dataset, an SOM will map it out pretty cleanly:
In this diagram, the classes have been separated and also ordered spatially. The background coloring in this case is a clustering density measure. There is some minimal overlap between the blue and green classes, where the SOM performed an imperfect separation. On the Iris dataset, an SOM will tend to approach a converged solution on the order of 100 iterations, with little visible improvement after 1,000. For more complex datasets containing less clearly divisible cases, this process can take tens of thousands of iterations.
Awkwardly, there aren't implementations of the SOM algorithm within pre-existing Python packages like scikit-learn. This makes it necessary for us to use our own implementation.
The SOM code we'll be working with for this purpose is located in the associated GitHub repository. For now, let's take a look at the relevant script and get an understanding of how the code works:
import numpy as np from sklearn.datasets import load_digits from som import Som from pylab import plot,axis,show,pcolor,colorbar,bone digits = load_digits() data = digits.data labels = digits.target
At this point, we've loaded the digits
dataset and identified labels
as a separate set of data. Doing this will enable us to observe how the SOM algorithm separates classes when assigning them to map
:
som = Som(16,16,64,sigma=1.0,learning_rate=0.5) som.random_weights_init(data) print("Initiating SOM.") som.train_random(data,10000) print("\n. SOM Processing Complete") bone() pcolor(som.distance_map().T) colorbar()
At this point, we have utilized a Som
class that is provided in a separate file, Som.py
, in the repository. This class contains the methods required to deliver the SOM algorithm we discussed earlier in the chapter. As arguments to this function, we provide the dimensions of the map (After trialing a range of options, we'll start out with 16 x 16 in this case—this grid size gave the feature map enough space to spread out while retaining some overlap between groups.) and the dimensionality of the input data. (This argument determines the length of the weight vector within the SOM's nodes.) We also provide values for sigma and learning rate.
Sigma, in this case, defines the spread of the neighborhood function. As noted previously, we're using a Gaussian neighborhood function. The appropriate value for sigma varies by grid size. For an 8 x 8 grid, we would typically want to use a value of 1.0 for Sigma, while in this case we're using 1.3 for a 16 x 16 grid. It is fairly obvious when one's value for sigma is off; if the value is too small, values tend to cluster near the center of the grid. If the values are too large, the grid typically ends up with several large, empty spaces towards the center.
The learning rate self-explanatorily defines the initial learning rate for the SOM. As the map continues to iterate, the learning rate adjusts according to the following function:
Here, t is the iteration index.
We follow up by first initializing our SOM with random weights.
Note
As with k-means clustering, this initialization method is slower than initializing based on an approximation of the data distribution. A preprocessing step similar to that employed by the k-means++ algorithm would accelerate the SOM's runtime. Our SOM runs sufficiently quickly over the digits
dataset to make this optimization unnecessary for now.
Next, we set up label and color assignations for each class, so that we can distinguish classes on the plotted SOM. Following this, we iterate through each data point.
On each iteration, we plot a class-specific marker for the BMU as calculated by our SOM algorithm.
When the SOM finishes iteration, we add a U-Matrix (a colorized matrix of relative observation density) as a monochrome-scaled plot layer:
labels[labels == '0'] = 0 labels[labels == '1'] = 1 labels[labels == '2'] = 2 labels[labels == '3'] = 3 labels[labels == '4'] = 4 labels[labels == '5'] = 5 labels[labels == '6'] = 6 labels[labels == '7'] = 7 labels[labels == '8'] = 8 labels[labels == '9'] = 9 markers = ['o', 'v', '1', '3', '8', 's', 'p', 'x', 'D', '*'] colors = ["r", "g", "b", "y", "c", (0,0.1,0.8), (1,0.5,0), (1,1,0.3), "m", (0.4,0.6,0)] for cnt,xx in enumerate(data): w = som.winner(xx) plot(w[0]+.5,w[1]+.5,markers[labels[cnt]], markerfacecolor='None', markeredgecolor=colors[labels[cnt]], markersize=12, markeredgewidth=2) axis([0,som.weights.shape[0],0,som.weights.shape[1]]) show()
This code generates a plot similar to the following:
This code delivers a 16 x 16 node SOM plot. As we can see, the map has done a reasonably good job of separating each cluster into topologically distinct areas of the map. Certain classes (particularly the digits five in cyan circles and nine in green stars) have been located over multiple parts of the SOM space. For the most part, though, each class occupies a distinct region and it's fair to say that the SOM has been reasonably effective. The U-Matrix shows that regions with a high density of points are co-habited by data from multiple classes. This isn't really a surprise as we saw similar results with k-means and PCA plotting.