Chapter 6: t-Distributed Stochastic Neighbor Embedding (t-SNE)
Activity 12: Wine t-SNE
Solution:
Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.manifold import TSNE
Load the Wine dataset using the wine.data file included in the accompanying source code and display the first five rows of data:
df = pd.read_csv('wine.data', header=None) df.head()
The output is as follows:
The first column contains the labels; extract this column and remove it from the dataset:
labels = df[0] del df[0]
Execute PCA to reduce the dataset to the first six components:
model_pca = PCA(n_components=6) wine_pca = model_pca.fit_transform(df)
Determine the amount of variance within the data described by these six components:
np.sum(model_pca.explained_variance_ratio_)
The output is as follows:
0.99999314824536
Create a t-SNE model using a specified random state and a verbose value of 1:
tsne_model = TSNE(random_state=0, verbose=1) tsne_model
The output is as follows:
Fit the PCA data to the t-SNE model:
wine_tsne = tsne_model.fit_transform(wine_pca.reshape((len(wine_pca), -1)))
The output is as follows:
Confirm that the shape of the t-SNE fitted data is two dimensional:
wine_tsne.shape
The output is as follows:
(172, 8)
Create a scatter plot of the two-dimensional data:
plt.figure(figsize=(10, 7)) plt.scatter(wine_tsne[:,0], wine_tsne[:,1]); plt.title('Low Dimensional Representation of Wine'); plt.show()
The output is as follows:
Create a secondary scatter plot of the two-dimensional data with the class labels applied to visualize any clustering that may be present:
MARKER = ['o', 'v', '^',] plt.figure(figsize=(10, 7)) plt.title('Low Dimensional Representation of Wine'); for i in range(1, 4): selections = wine_tsne[labels == i] plt.scatter(selections[:,0], selections[:,1], marker=MARKER[i-1], label=f'Wine {i}', s=30); plt.legend(); plt.show()
The output is as follows:
Note that while there is an overlap between the wine classes, it can also be seen that there is some clustering within the data. The first wine class is predominantly positioned in the top left-hand corner of the plot, the second wine class in the bottom-right, and the third wine class between the first two. This representation certainly couldn't be used to classify individual wine samples with great confidence, but it shows an overall trend and series of clusters contained within the high-dimensional data that we were unable to see earlier.
Activity 13: t-SNE Wine and Perplexity
Solution:
Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.manifold import TSNE
Load the Wine dataset and inspect the first five rows:
df = pd.read_csv('wine.data', header=None) df.head()
The output is as follows:
The first column provides the labels; extract them from the DataFrame and store them in a separate variable. Ensure that the column is removed from the DataFrame:
labels = df[0] del df[0]
Execute PCA on the dataset and extract the first six components:
model_pca = PCA(n_components=6) wine_pca = model_pca.fit_transform(df) wine_pca = wine_pca.reshape((len(wine_pca), -1))
Construct a loop that iterates through the perplexity values (1, 5, 20, 30, 80, 160, 320). For each loop, generate a t-SNE model with the corresponding perplexity and print a scatter plot of the labeled wine classes. Note the effect of different perplexity values:
MARKER = ['o', 'v', '^',] for perp in [1, 5, 20, 30, 80, 160, 320]: tsne_model = TSNE(random_state=0, verbose=1, perplexity=perp) wine_tsne = tsne_model.fit_transform(wine_pca) plt.figure(figsize=(10, 7)) plt.title(f'Low Dimensional Representation of Wine. Perplexity {perp}'); for i in range(1, 4): selections = wine_tsne[labels == i] plt.scatter(selections[:,0], selections[:,1], marker=MARKER[i-1], label=f'Wine {i}', s=30); plt.legend();
A perplexity value of 1 fails to separate the data into any particular structure:
Increasing the perplexity to 5 leads to a very non-linear structure that is difficult to separate, and it's hard to identify any clusters or patterns:
A perplexity of 20 finally starts to show some sort of horse-shoe structure. While visually obvious, this can still be tricky to implement:
A perplexity of 30 demonstrates quite good results. There is a linear relationship between the projected structure with some separation between the types of wine:
Finally, the last two images in the activity show the extent to which the plots can become increasingly complex and non-linear with increasing perplexity:
Here's the plot for a perplexity of 160:
Looking at the individual plots for each of the perplexity values, the effect perplexity has on the visualization of data is immediately obvious. Very small or very large perplexity values produces a range of unusual shapes that don't indicate the presence of any persistent pattern. The most plausible value seems to be 30, which produced the most linear plot we saw in the previous activity.
In this activity, we demonstrated the need to be careful when selecting the perplexity and that some iteration may be required to determine the correct value.
Activity 14: t-SNE Wine and Iterations
Solution:
Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.manifold import TSNE
Load the Wine dataset and inspect the first five rows:
df = pd.read_csv('wine.data', header=None) df.head()
The output is as follows:
The first column provides the labels; extract these from the DataFrame and store them in a separate variable. Ensure that the column is removed from the DataFrame:
labels = df[0] del df[0]
Execute PCA on the dataset and extract the first six components:
model_pca = PCA(n_components=6) wine_pca = model_pca.fit_transform(df) wine_pca = wine_pca.reshape((len(wine_pca), -1))
Construct a loop that iterates through the iteration values (250, 500, 1000). For each loop, generate a t-SNE model with the corresponding number of iterations and identical number of iterations without progress values:
MARKER = ['o', 'v', '1', 'p' ,'*', '+', 'x', 'd', '4', '.'] for iterations in [250, 500, 1000]: model_tsne = TSNE(random_state=0, verbose=1, n_iter=iterations, n_iter_without_progress=iterations) mnist_tsne = model_tsne.fit_transform(mnist_pca)
Construct a scatter plot of the labeled wine classes. Note the effect of different iteration values:
plt.figure(figsize=(10, 7)) plt.title(f'Low Dimensional Representation of MNIST (iterations = {iterations})'); for i in range(10): selections = mnist_tsne[mnist['labels'] == i] plt.scatter(selections[:,0], selections[:,1], alpha=0.2, marker=MARKER[i], s=5); x, y = selections.mean(axis=0) plt.text(x, y, str(i), fontdict={'weight': 'bold', 'size': 30})
The output is as follows:
Here's the plot for 500 iterations:
Here's the plot for 1,000 iterations:
Again, we can see the improvement in the structure of the data as the number of iterations increase. Even in a relatively simple dataset such as this, 250 iterations are not sufficient to project any structure of data into the lower-dimensional space.
As we observed in the corresponding activity, there is a balance to find in setting the iteration parameter. In this example, 250 iterations were insufficient, and at least 1,000 iterations were required for the final stabilization of the data.