Packt+ | Advance your knowledge in tech

You're reading from Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Product type Paperback

Published in May 2019

Publisher

ISBN-13 9781789952292

Length 482 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Authors (3):

Benjamin Johnston

Christopher Kruger

Aaron Jones

View More author details

Table of Contents (12) Chapters

Applied Unsupervised Learning with Python

Preface

1. Introduction to Clustering FREE CHAPTER

2. Hierarchical Clustering

3. Neighborhood Approaches and DBSCAN

4. Dimension Reduction and PCA

5. Autoencoders

6. t-Distributed Stochastic Neighbor Embedding (t-SNE)

7. Topic Modeling

8. Market Basket Analysis

9. Hotspot Analysis

Appendix

Chapter 4: Dimension Reduction and PCA

Activity 6: Manual PCA versus scikit-learn

Solution

Import the pandas, numpy, and matplotlib plotting libraries and the scikit-learn PCA model:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

Load the dataset and select only the sepal features as per the previous exercises. Display the first five rows of the data:
```
df = pd.read_csv('iris-data.csv')
df = df[['Sepal Length', 'Sepal Width']]
df.head()
```
The output is as follows:
Figure 4.43: The first five rows of the data
Compute the covariance matrix for the data:
```
cov = np.cov(df.values.T)
cov
```
The output is as follows:
Figure 4.44: The covariance matrix for the data
Transform the data using the scikit-learn API and only the first principal component. Store the transformed data in the sklearn_pca variable:
```
model = PCA(n_components=1)
sklearn_pca = model.fit_transform(df.values)
```
Transform the data using the manual PCA and only the first principal component. Store the transformed data in the manual_pca variable.
```
eigenvectors, eigenvalues, _ = np.linalg.svd(cov, full_matrices=False)
P = eigenvectors[0]
manual_pca = P.dot(df.values.T)
```

Plot the sklearn_pca and manual_pca values on the same plot to visualize the difference:

plt.figure(figsize=(10, 7));
plt.plot(sklearn_pca, label='Scikit-learn PCA');
plt.plot(manual_pca, label='Manual PCA', linestyle='--');
plt.xlabel('Sample');
plt.ylabel('Transformed Value');
plt.legend();

The output is as follows:

Figure 4.45: A plot of the data

Notice that the two plots look almost identical, except that one is a mirror image of another and there is an offset between the two. Display the components of the sklearn_pca and manual_pca models:
```
model.components_
```
The output is as follows:
```
array([[ 0.99693955, -0.07817635]])
```
Now print P:
```
P
```
The output is as follows:
```
array([-0.99693955,  0.07817635])
```
Notice the difference in the signs; the values are identical, but the signs are different, producing the mirror image result. This is just a difference in convention, nothing meaningful.

Multiply the manual_pca models by -1 and re-plot:

manual_pca *= -1
plt.figure(figsize=(10, 7));
plt.plot(sklearn_pca, label='Scikit-learn PCA');
plt.plot(manual_pca, label='Manual PCA', linestyle='--');
plt.xlabel('Sample');
plt.ylabel('Transformed Value');
plt.legend();

The output is as follows:

Figure 4.46: Re-plotted data

Now, all we need to do is deal with the offset between the two. The scikit-learn API subtracts the mean of the data prior to the transform. Subtract the mean of each column from the dataset before completing the transform with manual PCA:
```
mean_vals = np.mean(df.values, axis=0)
offset_vals = df.values - mean_vals
manual_pca = P.dot(offset_vals.T)
```
Multiply the result by -1:
```
manual_pca *= -1
```

Re-plot the individual sklearn_pca and manual_pca values:

plt.figure(figsize=(10, 7));
plt.plot(sklearn_pca, label='Scikit-learn PCA');
plt.plot(manual_pca, label='Manual PCA', linestyle='--');
plt.xlabel('Sample');
plt.ylabel('Transformed Value');
plt.legend();

The output is as follows:

Figure 4.47: Re-plotting the data

The final plot will demonstrate that the dimensionality reduction completed by the two methods are, in fact, the same. The differences lie in the differences in the signs of the covariance matrices, as the two methods simply use a different feature as the baseline for comparison. Finally, there is also an offset between the two datasets, which is attributed to the mean samples being subtracted before executing the transform in the scikit-learn PCA.

Activity 7: PCA Using the Expanded Iris Dataset

Solution:

Import pandas and matplotlib. To enable 3D plotting, you will also need to import Axes3D:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D # Required for 3D plotting

Read in the dataset and select the columns Sepal Length, Sepal Width, and Petal Width:
```
df = pd.read_csv('iris-data.csv')[['Sepal Length', 'Sepal Width', 'Petal Width']]
df.head()
```
The output is as follows:
Figure 4.48: Sepal Length, Sepal Width, and Petal Width

Plot the data in three dimensions:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['Sepal Length'], df['Sepal Width'], df['Petal Width']);
ax.set_xlabel('Sepal Length (mm)');
ax.set_ylabel('Sepal Width (mm)');
ax.set_zlabel('Petal Width (mm)');
ax.set_title('Expanded Iris Dataset');

The output is as follows:

Figure 4.49: Expanded Iris dataset plot

Create a PCA model without specifying the number of components:
```
model = PCA()
```
Fit the model to the dataset:
```
model.fit(df.values)
```
The output is as follows:
Figure 4.50: The model fitted to the dataset
Display the eigenvalues or explained_variance_ratio_:
```
model.explained_variance_ratio_
```
The output is as follows:
```
array([0.8004668 , 0.14652357, 0.05300962])
```
We want to reduce the dimensionality of the dataset, but still keep at least 90% of the variance. What are the minimum number of components required to keep 90% of the variance?
The first two components are required for at least 90% variance. The first two components provide 94.7% of the variance within the dataset.
Create a new PCA model, this time specifying the number of components required to keep at least 90% of the variance:
```
model = PCA(n_components=2)
```

Transform the data using the new model:

data_transformed = model.fit_transform(df.values)

Plot the transformed data:
```
plt.figure(figsize=(10, 7))
plt.scatter(data_transformed[:,0], data_transformed[:,1]);
```
The output is as follows:
Figure 4.51: Plot of the transformed data

Restore the transformed data to the original dataspace:

data_restored = model.inverse_transform(data_transformed)

Plot the restored data in three dimensions in one subplot and the original data in a second subplot to visualize the effect of removing some of the variance:

fig = plt.figure(figsize=(10, 14))

# Original Data
ax = fig.add_subplot(211, projection='3d')
ax.scatter(df['Sepal Length'], df['Sepal Width'], df['Petal Width'], label='Original Data');
ax.set_xlabel('Sepal Length (mm)');
ax.set_ylabel('Sepal Width (mm)');
ax.set_zlabel('Petal Width (mm)');
ax.set_title('Expanded Iris Dataset');

# Transformed Data
ax = fig.add_subplot(212, projection='3d')
ax.scatter(data_restored[:,0], data_restored[:,1], data_restored[:,2], label='Restored Data');
ax.set_xlabel('Sepal Length (mm)');
ax.set_ylabel('Sepal Width (mm)');
ax.set_zlabel('Petal Width (mm)');
ax.set_title('Restored Iris Dataset');

The output is as follows:

Figure 4.52: Plot of the expanded and the restored Iris datasets

Looking at Figure 4.52, we can see that, as we did with the 2D plots, we have removed much of the noise within the data, but retained the most important information regarding the trends within the data. It can be seen that in general, the sepal length increases with the petal width and that there seems to be two clusters of data within the plots, one sitting above the other.

Note

When applying PCA, it is important to keep in mind the size of the data being modelled, as well as the available system memory. The singular value decomposition process involves separating the data into the eigenvalues and eigenvectors, and can be quite memory intensive. If the dataset is too large, you may either be unable to complete the process, suffer significant performance loss, or lock up your system.