Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with Python

You're reading from   Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Arrow left icon
Product type Paperback
Published in May 2019
Publisher
ISBN-13 9781789952292
Length 482 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (3):
Arrow left icon
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Christopher Kruger Christopher Kruger
Author Profile Icon Christopher Kruger
Christopher Kruger
Aaron Jones Aaron Jones
Author Profile Icon Aaron Jones
Aaron Jones
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Applied Unsupervised Learning with Python
Preface
1. Introduction to Clustering 2. Hierarchical Clustering FREE CHAPTER 3. Neighborhood Approaches and DBSCAN 4. Dimension Reduction and PCA 5. Autoencoders 6. t-Distributed Stochastic Neighbor Embedding (t-SNE) 7. Topic Modeling 8. Market Basket Analysis 9. Hotspot Analysis Appendix

Chapter 6: t-Distributed Stochastic Neighbor Embedding (t-SNE)


Activity 12: Wine t-SNE

Solution:

  1. Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from sklearn.manifold import TSNE
  2. Load the Wine dataset using the wine.data file included in the accompanying source code and display the first five rows of data:

    df = pd.read_csv('wine.data', header=None)
    df.head()

    The output is as follows:

    Figure 6.24: The first five rows of the wine dataset.

  3. The first column contains the labels; extract this column and remove it from the dataset:

    labels = df[0]
    del df[0]
  4. Execute PCA to reduce the dataset to the first six components:

    model_pca = PCA(n_components=6)
    wine_pca = model_pca.fit_transform(df)
  5. Determine the amount of variance within the data described by these six components:

    np.sum(model_pca.explained_variance_ratio_)

    The output is as follows:

    0.99999314824536
  6. Create a t-SNE model using a specified random state and a verbose value of 1:

    tsne_model = TSNE(random_state=0, verbose=1)
    tsne_model

    The output is as follows:

    Figure 6.25: Creating t-SNE model.

  7. Fit the PCA data to the t-SNE model:

    wine_tsne = tsne_model.fit_transform(wine_pca.reshape((len(wine_pca), -1)))

    The output is as follows:

    Figure 6.26: Fitting PCA data t-SNE model

  8. Confirm that the shape of the t-SNE fitted data is two dimensional:

    wine_tsne.shape

    The output is as follows:

    (172, 8)
  9. Create a scatter plot of the two-dimensional data:

    plt.figure(figsize=(10, 7))
    plt.scatter(wine_tsne[:,0], wine_tsne[:,1]);
    plt.title('Low Dimensional Representation of Wine');
    plt.show()

    The output is as follows:

    Figure 6.27: Scatterplot of two-dimensional data

  10. Create a secondary scatter plot of the two-dimensional data with the class labels applied to visualize any clustering that may be present:

    MARKER = ['o', 'v', '^',]
    plt.figure(figsize=(10, 7))
    plt.title('Low Dimensional Representation of Wine');
    for i in range(1, 4):
        selections = wine_tsne[labels == i]
        plt.scatter(selections[:,0], selections[:,1], marker=MARKER[i-1], label=f'Wine {i}', s=30);
        plt.legend();
    plt.show()

    The output is as follows:

    Figure 6.28: Secondary plot of two-dimensional data

Note that while there is an overlap between the wine classes, it can also be seen that there is some clustering within the data. The first wine class is predominantly positioned in the top left-hand corner of the plot, the second wine class in the bottom-right, and the third wine class between the first two. This representation certainly couldn't be used to classify individual wine samples with great confidence, but it shows an overall trend and series of clusters contained within the high-dimensional data that we were unable to see earlier.

Activity 13: t-SNE Wine and Perplexity

Solution:

  1. Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from sklearn.manifold import TSNE
  2. Load the Wine dataset and inspect the first five rows:

    df = pd.read_csv('wine.data', header=None)
    df.head()

    The output is as follows:

    Figure 6.29: The first five rows of wine data.

  3. The first column provides the labels; extract them from the DataFrame and store them in a separate variable. Ensure that the column is removed from the DataFrame:

    labels = df[0]
    del df[0]
  4. Execute PCA on the dataset and extract the first six components:

    model_pca = PCA(n_components=6)
    wine_pca = model_pca.fit_transform(df)
    wine_pca = wine_pca.reshape((len(wine_pca), -1))
  5. Construct a loop that iterates through the perplexity values (1, 5, 20, 30, 80, 160, 320). For each loop, generate a t-SNE model with the corresponding perplexity and print a scatter plot of the labeled wine classes. Note the effect of different perplexity values:

    MARKER = ['o', 'v', '^',]
    for perp in [1, 5, 20, 30, 80, 160, 320]:
        tsne_model = TSNE(random_state=0, verbose=1, perplexity=perp)
        wine_tsne = tsne_model.fit_transform(wine_pca)
        plt.figure(figsize=(10, 7))
        plt.title(f'Low Dimensional Representation of Wine. Perplexity {perp}');
        for i in range(1, 4):
            selections = wine_tsne[labels == i]
            plt.scatter(selections[:,0], selections[:,1], marker=MARKER[i-1], label=f'Wine {i}', s=30);
            plt.legend();

    A perplexity value of 1 fails to separate the data into any particular structure:

    Figure 6.30: Plot for perplexity value 1

    Increasing the perplexity to 5 leads to a very non-linear structure that is difficult to separate, and it's hard to identify any clusters or patterns:

    Figure 6.31: Plot for perplexity of 5

    A perplexity of 20 finally starts to show some sort of horse-shoe structure. While visually obvious, this can still be tricky to implement:

    Figure 6.32: Plot for perplexity of 20

    A perplexity of 30 demonstrates quite good results. There is a linear relationship between the projected structure with some separation between the types of wine:

    Figure 6.33: Plot for perplexity of 30

    Finally, the last two images in the activity show the extent to which the plots can become increasingly complex and non-linear with increasing perplexity:

    Figure 6.34: Plot for perplexity of 80

    Here's the plot for a perplexity of 160:

    Figure 6.35: Plot for perplexity of 160

Looking at the individual plots for each of the perplexity values, the effect perplexity has on the visualization of data is immediately obvious. Very small or very large perplexity values produces a range of unusual shapes that don't indicate the presence of any persistent pattern. The most plausible value seems to be 30, which produced the most linear plot we saw in the previous activity.

In this activity, we demonstrated the need to be careful when selecting the perplexity and that some iteration may be required to determine the correct value.

Activity 14: t-SNE Wine and Iterations

Solution:

  1. Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from sklearn.manifold import TSNE
  2. Load the Wine dataset and inspect the first five rows:

    df = pd.read_csv('wine.data', header=None)
    df.head()

    The output is as follows:

    Figure 6.36: The first five rows of wine dataset

  3. The first column provides the labels; extract these from the DataFrame and store them in a separate variable. Ensure that the column is removed from the DataFrame:

    labels = df[0]
    del df[0]
  4. Execute PCA on the dataset and extract the first six components:

    model_pca = PCA(n_components=6)
    wine_pca = model_pca.fit_transform(df)
    wine_pca = wine_pca.reshape((len(wine_pca), -1))
  5. Construct a loop that iterates through the iteration values (250, 500, 1000). For each loop, generate a t-SNE model with the corresponding number of iterations and identical number of iterations without progress values:

    MARKER = ['o', 'v', '1', 'p' ,'*', '+', 'x', 'd', '4', '.']
    for iterations in [250, 500, 1000]:
        model_tsne = TSNE(random_state=0, verbose=1, n_iter=iterations, n_iter_without_progress=iterations)
        mnist_tsne = model_tsne.fit_transform(mnist_pca)
  6. Construct a scatter plot of the labeled wine classes. Note the effect of different iteration values:

        plt.figure(figsize=(10, 7))
        plt.title(f'Low Dimensional Representation of MNIST (iterations = {iterations})');
        for i in range(10):
            selections = mnist_tsne[mnist['labels'] == i]
            plt.scatter(selections[:,0], selections[:,1], alpha=0.2, marker=MARKER[i], s=5);
            x, y = selections.mean(axis=0)
            plt.text(x, y, str(i), fontdict={'weight': 'bold', 'size': 30}) 

    The output is as follows:

    Figure 6.37: Scatterplot of wine classes with 250 iterations

    Here's the plot for 500 iterations:

    Figure 6.38: Scatterplot of wine classes with 500 iterations

    Here's the plot for 1,000 iterations:

    Figure 6.39: Scatterplot of wine classes with 1,000 iterations

Again, we can see the improvement in the structure of the data as the number of iterations increase. Even in a relatively simple dataset such as this, 250 iterations are not sufficient to project any structure of data into the lower-dimensional space.

As we observed in the corresponding activity, there is a balance to find in setting the iteration parameter. In this example, 250 iterations were insufficient, and at least 1,000 iterations were required for the final stabilization of the data.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime