Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with Python

You're reading from   Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Arrow left icon
Product type Paperback
Published in May 2019
Publisher
ISBN-13 9781789952292
Length 482 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (3):
Arrow left icon
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Christopher Kruger Christopher Kruger
Author Profile Icon Christopher Kruger
Christopher Kruger
Aaron Jones Aaron Jones
Author Profile Icon Aaron Jones
Aaron Jones
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Applied Unsupervised Learning with Python
Preface
1. Introduction to Clustering FREE CHAPTER 2. Hierarchical Clustering 3. Neighborhood Approaches and DBSCAN 4. Dimension Reduction and PCA 5. Autoencoders 6. t-Distributed Stochastic Neighbor Embedding (t-SNE) 7. Topic Modeling 8. Market Basket Analysis 9. Hotspot Analysis Appendix

Chapter 4: Dimension Reduction and PCA


Activity 6: Manual PCA versus scikit-learn

Solution

  1. Import the pandas, numpy, and matplotlib plotting libraries and the scikit-learn PCA model:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
  2. Load the dataset and select only the sepal features as per the previous exercises. Display the first five rows of the data:

    df = pd.read_csv('iris-data.csv')
    df = df[['Sepal Length', 'Sepal Width']]
    df.head()

    The output is as follows:

    Figure 4.43: The first five rows of the data

  3. Compute the covariance matrix for the data:

    cov = np.cov(df.values.T)
    cov

    The output is as follows:

    Figure 4.44: The covariance matrix for the data

  4. Transform the data using the scikit-learn API and only the first principal component. Store the transformed data in the sklearn_pca variable:

    model = PCA(n_components=1)
    sklearn_pca = model.fit_transform(df.values)
  5. Transform the data using the manual PCA and only the first principal component. Store the transformed data in the manual_pca variable.

    eigenvectors, eigenvalues, _ = np.linalg.svd(cov, full_matrices=False)
    P = eigenvectors[0]
    manual_pca = P.dot(df.values.T)
  6. Plot the sklearn_pca and manual_pca values on the same plot to visualize the difference:

    plt.figure(figsize=(10, 7));
    plt.plot(sklearn_pca, label='Scikit-learn PCA');
    plt.plot(manual_pca, label='Manual PCA', linestyle='--');
    plt.xlabel('Sample');
    plt.ylabel('Transformed Value');
    plt.legend();

    The output is as follows:

    Figure 4.45: A plot of the data

  7. Notice that the two plots look almost identical, except that one is a mirror image of another and there is an offset between the two. Display the components of the sklearn_pca and manual_pca models:

    model.components_

    The output is as follows:

    array([[ 0.99693955, -0.07817635]])

    Now print P:

    P

    The output is as follows:

    array([-0.99693955,  0.07817635])

    Notice the difference in the signs; the values are identical, but the signs are different, producing the mirror image result. This is just a difference in convention, nothing meaningful.

  8. Multiply the manual_pca models by -1 and re-plot:

    manual_pca *= -1
    plt.figure(figsize=(10, 7));
    plt.plot(sklearn_pca, label='Scikit-learn PCA');
    plt.plot(manual_pca, label='Manual PCA', linestyle='--');
    plt.xlabel('Sample');
    plt.ylabel('Transformed Value');
    plt.legend();

    The output is as follows:

    Figure 4.46: Re-plotted data

  9. Now, all we need to do is deal with the offset between the two. The scikit-learn API subtracts the mean of the data prior to the transform. Subtract the mean of each column from the dataset before completing the transform with manual PCA:

    mean_vals = np.mean(df.values, axis=0)
    offset_vals = df.values - mean_vals
    manual_pca = P.dot(offset_vals.T)
  10. Multiply the result by -1:

    manual_pca *= -1
  11. Re-plot the individual sklearn_pca and manual_pca values:

    plt.figure(figsize=(10, 7));
    plt.plot(sklearn_pca, label='Scikit-learn PCA');
    plt.plot(manual_pca, label='Manual PCA', linestyle='--');
    plt.xlabel('Sample');
    plt.ylabel('Transformed Value');
    plt.legend();

    The output is as follows:

    Figure 4.47: Re-plotting the data

The final plot will demonstrate that the dimensionality reduction completed by the two methods are, in fact, the same. The differences lie in the differences in the signs of the covariance matrices, as the two methods simply use a different feature as the baseline for comparison. Finally, there is also an offset between the two datasets, which is attributed to the mean samples being subtracted before executing the transform in the scikit-learn PCA.

Activity 7: PCA Using the Expanded Iris Dataset

Solution:

  1. Import pandas and matplotlib. To enable 3D plotting, you will also need to import Axes3D:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from mpl_toolkits.mplot3d import Axes3D # Required for 3D plotting
  2. Read in the dataset and select the columns Sepal Length, Sepal Width, and Petal Width:

    df = pd.read_csv('iris-data.csv')[['Sepal Length', 'Sepal Width', 'Petal Width']]
    df.head()

    The output is as follows:

    Figure 4.48: Sepal Length, Sepal Width, and Petal Width

  3. Plot the data in three dimensions:

    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(df['Sepal Length'], df['Sepal Width'], df['Petal Width']);
    ax.set_xlabel('Sepal Length (mm)');
    ax.set_ylabel('Sepal Width (mm)');
    ax.set_zlabel('Petal Width (mm)');
    ax.set_title('Expanded Iris Dataset');

    The output is as follows:

    Figure 4.49: Expanded Iris dataset plot

  4. Create a PCA model without specifying the number of components:

    model = PCA()
  5. Fit the model to the dataset:

    model.fit(df.values)

    The output is as follows:

    Figure 4.50: The model fitted to the dataset

  6. Display the eigenvalues or explained_variance_ratio_:

    model.explained_variance_ratio_

    The output is as follows:

    array([0.8004668 , 0.14652357, 0.05300962])
  7. We want to reduce the dimensionality of the dataset, but still keep at least 90% of the variance. What are the minimum number of components required to keep 90% of the variance?

    The first two components are required for at least 90% variance. The first two components provide 94.7% of the variance within the dataset.

  8. Create a new PCA model, this time specifying the number of components required to keep at least 90% of the variance:

    model = PCA(n_components=2)
  9. Transform the data using the new model:

    data_transformed = model.fit_transform(df.values)
  10. Plot the transformed data:

    plt.figure(figsize=(10, 7))
    plt.scatter(data_transformed[:,0], data_transformed[:,1]);

    The output is as follows:

    Figure 4.51: Plot of the transformed data

  11. Restore the transformed data to the original dataspace:

    data_restored = model.inverse_transform(data_transformed)
  12. Plot the restored data in three dimensions in one subplot and the original data in a second subplot to visualize the effect of removing some of the variance:

    fig = plt.figure(figsize=(10, 14))
    
    # Original Data
    ax = fig.add_subplot(211, projection='3d')
    ax.scatter(df['Sepal Length'], df['Sepal Width'], df['Petal Width'], label='Original Data');
    ax.set_xlabel('Sepal Length (mm)');
    ax.set_ylabel('Sepal Width (mm)');
    ax.set_zlabel('Petal Width (mm)');
    ax.set_title('Expanded Iris Dataset');
    
    # Transformed Data
    ax = fig.add_subplot(212, projection='3d')
    ax.scatter(data_restored[:,0], data_restored[:,1], data_restored[:,2], label='Restored Data');
    ax.set_xlabel('Sepal Length (mm)');
    ax.set_ylabel('Sepal Width (mm)');
    ax.set_zlabel('Petal Width (mm)');
    ax.set_title('Restored Iris Dataset');

    The output is as follows:

    Figure 4.52: Plot of the expanded and the restored Iris datasets

Looking at Figure 4.52, we can see that, as we did with the 2D plots, we have removed much of the noise within the data, but retained the most important information regarding the trends within the data. It can be seen that in general, the sepal length increases with the petal width and that there seems to be two clusters of data within the plots, one sitting above the other.

Note

When applying PCA, it is important to keep in mind the size of the data being modelled, as well as the available system memory. The singular value decomposition process involves separating the data into the eigenvalues and eigenvectors, and can be quite memory intensive. If the dataset is too large, you may either be unable to complete the process, suffer significant performance loss, or lock up your system.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image