Packt+ | Advance your knowledge in tech

You're reading from Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Product type Paperback

Published in May 2019

Publisher

ISBN-13 9781789952292

Length 482 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (3):

Benjamin Johnston

Christopher Kruger

Aaron Jones

View More author details

Table of Contents (12) Chapters

Applied Unsupervised Learning with Python

Preface

1. Introduction to Clustering FREE CHAPTER

2. Hierarchical Clustering

3. Neighborhood Approaches and DBSCAN

4. Dimension Reduction and PCA

5. Autoencoders

6. t-Distributed Stochastic Neighbor Embedding (t-SNE)

7. Topic Modeling

8. Market Basket Analysis

9. Hotspot Analysis

Appendix

Chapter 6: t-Distributed Stochastic Neighbor Embedding (t-SNE)

Activity 12: Wine t-SNE

Solution:

Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

Load the Wine dataset using the wine.data file included in the accompanying source code and display the first five rows of data:
```
df = pd.read_csv('wine.data', header=None)
df.head()
```
The output is as follows:
Figure 6.24: The first five rows of the wine dataset.
The first column contains the labels; extract this column and remove it from the dataset:
```
labels = df[0]
del df[0]
```

Execute PCA to reduce the dataset to the first six components:

model_pca = PCA(n_components=6)
wine_pca = model_pca.fit_transform(df)

Determine the amount of variance within the data described by these six components:
```
np.sum(model_pca.explained_variance_ratio_)
```
The output is as follows:
```
0.99999314824536
```
Create a t-SNE model using a specified random state and a verbose value of 1:
```
tsne_model = TSNE(random_state=0, verbose=1)
tsne_model
```
The output is as follows:
Figure 6.25: Creating t-SNE model.
Fit the PCA data to the t-SNE model:
```
wine_tsne = tsne_model.fit_transform(wine_pca.reshape((len(wine_pca), -1)))
```
The output is as follows:
Figure 6.26: Fitting PCA data t-SNE model
Confirm that the shape of the t-SNE fitted data is two dimensional:
```
wine_tsne.shape
```
The output is as follows:
```
(172, 8)
```

Create a scatter plot of the two-dimensional data:

plt.figure(figsize=(10, 7))
plt.scatter(wine_tsne[:,0], wine_tsne[:,1]);
plt.title('Low Dimensional Representation of Wine');
plt.show()

The output is as follows:

Figure 6.27: Scatterplot of two-dimensional data

Create a secondary scatter plot of the two-dimensional data with the class labels applied to visualize any clustering that may be present:

MARKER = ['o', 'v', '^',]
plt.figure(figsize=(10, 7))
plt.title('Low Dimensional Representation of Wine');
for i in range(1, 4):
    selections = wine_tsne[labels == i]
    plt.scatter(selections[:,0], selections[:,1], marker=MARKER[i-1], label=f'Wine {i}', s=30);
    plt.legend();
plt.show()

The output is as follows:

Figure 6.28: Secondary plot of two-dimensional data

Note that while there is an overlap between the wine classes, it can also be seen that there is some clustering within the data. The first wine class is predominantly positioned in the top left-hand corner of the plot, the second wine class in the bottom-right, and the third wine class between the first two. This representation certainly couldn't be used to classify individual wine samples with great confidence, but it shows an overall trend and series of clusters contained within the high-dimensional data that we were unable to see earlier.

Activity 13: t-SNE Wine and Perplexity

Solution:

Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

Load the Wine dataset and inspect the first five rows:
```
df = pd.read_csv('wine.data', header=None)
df.head()
```
The output is as follows:
Figure 6.29: The first five rows of wine data.
The first column provides the labels; extract them from the DataFrame and store them in a separate variable. Ensure that the column is removed from the DataFrame:
```
labels = df[0]
del df[0]
```

Execute PCA on the dataset and extract the first six components:

model_pca = PCA(n_components=6)
wine_pca = model_pca.fit_transform(df)
wine_pca = wine_pca.reshape((len(wine_pca), -1))

Construct a loop that iterates through the perplexity values (1, 5, 20, 30, 80, 160, 320). For each loop, generate a t-SNE model with the corresponding perplexity and print a scatter plot of the labeled wine classes. Note the effect of different perplexity values:
```
MARKER = ['o', 'v', '^',]
for perp in [1, 5, 20, 30, 80, 160, 320]:
    tsne_model = TSNE(random_state=0, verbose=1, perplexity=perp)
    wine_tsne = tsne_model.fit_transform(wine_pca)
    plt.figure(figsize=(10, 7))
    plt.title(f'Low Dimensional Representation of Wine. Perplexity {perp}');
    for i in range(1, 4):
        selections = wine_tsne[labels == i]
        plt.scatter(selections[:,0], selections[:,1], marker=MARKER[i-1], label=f'Wine {i}', s=30);
        plt.legend();
```
A perplexity value of 1 fails to separate the data into any particular structure:
Figure 6.30: Plot for perplexity value 1
Increasing the perplexity to 5 leads to a very non-linear structure that is difficult to separate, and it's hard to identify any clusters or patterns:
Figure 6.31: Plot for perplexity of 5
A perplexity of 20 finally starts to show some sort of horse-shoe structure. While visually obvious, this can still be tricky to implement:
Figure 6.32: Plot for perplexity of 20
A perplexity of 30 demonstrates quite good results. There is a linear relationship between the projected structure with some separation between the types of wine:
Figure 6.33: Plot for perplexity of 30
Finally, the last two images in the activity show the extent to which the plots can become increasingly complex and non-linear with increasing perplexity:
Figure 6.34: Plot for perplexity of 80
Here's the plot for a perplexity of 160:
Figure 6.35: Plot for perplexity of 160

Looking at the individual plots for each of the perplexity values, the effect perplexity has on the visualization of data is immediately obvious. Very small or very large perplexity values produces a range of unusual shapes that don't indicate the presence of any persistent pattern. The most plausible value seems to be 30, which produced the most linear plot we saw in the previous activity.

In this activity, we demonstrated the need to be careful when selecting the perplexity and that some iteration may be required to determine the correct value.

Activity 14: t-SNE Wine and Iterations

Solution:

Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

Load the Wine dataset and inspect the first five rows:
```
df = pd.read_csv('wine.data', header=None)
df.head()
```
The output is as follows:
Figure 6.36: The first five rows of wine dataset
The first column provides the labels; extract these from the DataFrame and store them in a separate variable. Ensure that the column is removed from the DataFrame:
```
labels = df[0]
del df[0]
```

Execute PCA on the dataset and extract the first six components:

model_pca = PCA(n_components=6)
wine_pca = model_pca.fit_transform(df)
wine_pca = wine_pca.reshape((len(wine_pca), -1))

Construct a loop that iterates through the iteration values (250, 500, 1000). For each loop, generate a t-SNE model with the corresponding number of iterations and identical number of iterations without progress values:

MARKER = ['o', 'v', '1', 'p' ,'*', '+', 'x', 'd', '4', '.']
for iterations in [250, 500, 1000]:
    model_tsne = TSNE(random_state=0, verbose=1, n_iter=iterations, n_iter_without_progress=iterations)
    mnist_tsne = model_tsne.fit_transform(mnist_pca)

Construct a scatter plot of the labeled wine classes. Note the effect of different iteration values:

    plt.figure(figsize=(10, 7))
    plt.title(f'Low Dimensional Representation of MNIST (iterations = {iterations})');
    for i in range(10):
        selections = mnist_tsne[mnist['labels'] == i]
        plt.scatter(selections[:,0], selections[:,1], alpha=0.2, marker=MARKER[i], s=5);
        x, y = selections.mean(axis=0)
        plt.text(x, y, str(i), fontdict={'weight': 'bold', 'size': 30})

The output is as follows:

Figure 6.37: Scatterplot of wine classes with 250 iterations

Here's the plot for 500 iterations:

Figure 6.38: Scatterplot of wine classes with 500 iterations

Here's the plot for 1,000 iterations:

Figure 6.39: Scatterplot of wine classes with 1,000 iterations

Again, we can see the improvement in the structure of the data as the number of iterations increase. Even in a relatively simple dataset such as this, 250 iterations are not sufficient to project any structure of data into the lower-dimensional space.

As we observed in the corresponding activity, there is a balance to find in setting the iteration parameter. In this example, 250 iterations were insufficient, and at least 1,000 iterations were required for the final stabilization of the data.