Chapter 4: Dimension Reduction and PCA
Activity 6: Manual PCA versus scikit-learn
Solution
Import the pandas, numpy, and matplotlib plotting libraries and the scikit-learn PCA model:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA
Load the dataset and select only the sepal features as per the previous exercises. Display the first five rows of the data:
df = pd.read_csv('iris-data.csv') df = df[['Sepal Length', 'Sepal Width']] df.head()
The output is as follows:
Compute the covariance matrix for the data:
cov = np.cov(df.values.T) cov
The output is as follows:
Transform the data using the scikit-learn API and only the first principal component. Store the transformed data in the sklearn_pca variable:
model = PCA(n_components=1) sklearn_pca = model.fit_transform(df.values)
Transform the data using the manual PCA and only the first principal component. Store the transformed data in the manual_pca variable.
eigenvectors, eigenvalues, _ = np.linalg.svd(cov, full_matrices=False) P = eigenvectors[0] manual_pca = P.dot(df.values.T)
Plot the sklearn_pca and manual_pca values on the same plot to visualize the difference:
plt.figure(figsize=(10, 7)); plt.plot(sklearn_pca, label='Scikit-learn PCA'); plt.plot(manual_pca, label='Manual PCA', linestyle='--'); plt.xlabel('Sample'); plt.ylabel('Transformed Value'); plt.legend();
The output is as follows:
Notice that the two plots look almost identical, except that one is a mirror image of another and there is an offset between the two. Display the components of the sklearn_pca and manual_pca models:
model.components_
The output is as follows:
array([[ 0.99693955, -0.07817635]])
Now print P:
P
The output is as follows:
array([-0.99693955, 0.07817635])
Notice the difference in the signs; the values are identical, but the signs are different, producing the mirror image result. This is just a difference in convention, nothing meaningful.
Multiply the manual_pca models by -1 and re-plot:
manual_pca *= -1 plt.figure(figsize=(10, 7)); plt.plot(sklearn_pca, label='Scikit-learn PCA'); plt.plot(manual_pca, label='Manual PCA', linestyle='--'); plt.xlabel('Sample'); plt.ylabel('Transformed Value'); plt.legend();
The output is as follows:
Now, all we need to do is deal with the offset between the two. The scikit-learn API subtracts the mean of the data prior to the transform. Subtract the mean of each column from the dataset before completing the transform with manual PCA:
mean_vals = np.mean(df.values, axis=0) offset_vals = df.values - mean_vals manual_pca = P.dot(offset_vals.T)
Multiply the result by -1:
manual_pca *= -1
Re-plot the individual sklearn_pca and manual_pca values:
plt.figure(figsize=(10, 7)); plt.plot(sklearn_pca, label='Scikit-learn PCA'); plt.plot(manual_pca, label='Manual PCA', linestyle='--'); plt.xlabel('Sample'); plt.ylabel('Transformed Value'); plt.legend();
The output is as follows:
The final plot will demonstrate that the dimensionality reduction completed by the two methods are, in fact, the same. The differences lie in the differences in the signs of the covariance matrices, as the two methods simply use a different feature as the baseline for comparison. Finally, there is also an offset between the two datasets, which is attributed to the mean samples being subtracted before executing the transform in the scikit-learn PCA.
Activity 7: PCA Using the Expanded Iris Dataset
Solution:
Import pandas and matplotlib. To enable 3D plotting, you will also need to import Axes3D:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.decomposition import PCA from mpl_toolkits.mplot3d import Axes3D # Required for 3D plotting
Read in the dataset and select the columns Sepal Length, Sepal Width, and Petal Width:
df = pd.read_csv('iris-data.csv')[['Sepal Length', 'Sepal Width', 'Petal Width']] df.head()
The output is as follows:
Plot the data in three dimensions:
fig = plt.figure(figsize=(10, 7)) ax = fig.add_subplot(111, projection='3d') ax.scatter(df['Sepal Length'], df['Sepal Width'], df['Petal Width']); ax.set_xlabel('Sepal Length (mm)'); ax.set_ylabel('Sepal Width (mm)'); ax.set_zlabel('Petal Width (mm)'); ax.set_title('Expanded Iris Dataset');
The output is as follows:
Create a PCA model without specifying the number of components:
model = PCA()
Fit the model to the dataset:
model.fit(df.values)
The output is as follows:
Display the eigenvalues or explained_variance_ratio_:
model.explained_variance_ratio_
The output is as follows:
array([0.8004668 , 0.14652357, 0.05300962])
We want to reduce the dimensionality of the dataset, but still keep at least 90% of the variance. What are the minimum number of components required to keep 90% of the variance?
The first two components are required for at least 90% variance. The first two components provide 94.7% of the variance within the dataset.
Create a new PCA model, this time specifying the number of components required to keep at least 90% of the variance:
model = PCA(n_components=2)
Transform the data using the new model:
data_transformed = model.fit_transform(df.values)
Plot the transformed data:
plt.figure(figsize=(10, 7)) plt.scatter(data_transformed[:,0], data_transformed[:,1]);
The output is as follows:
Restore the transformed data to the original dataspace:
data_restored = model.inverse_transform(data_transformed)
Plot the restored data in three dimensions in one subplot and the original data in a second subplot to visualize the effect of removing some of the variance:
fig = plt.figure(figsize=(10, 14)) # Original Data ax = fig.add_subplot(211, projection='3d') ax.scatter(df['Sepal Length'], df['Sepal Width'], df['Petal Width'], label='Original Data'); ax.set_xlabel('Sepal Length (mm)'); ax.set_ylabel('Sepal Width (mm)'); ax.set_zlabel('Petal Width (mm)'); ax.set_title('Expanded Iris Dataset'); # Transformed Data ax = fig.add_subplot(212, projection='3d') ax.scatter(data_restored[:,0], data_restored[:,1], data_restored[:,2], label='Restored Data'); ax.set_xlabel('Sepal Length (mm)'); ax.set_ylabel('Sepal Width (mm)'); ax.set_zlabel('Petal Width (mm)'); ax.set_title('Restored Iris Dataset');
The output is as follows:
Looking at Figure 4.52, we can see that, as we did with the 2D plots, we have removed much of the noise within the data, but retained the most important information regarding the trends within the data. It can be seen that in general, the sepal length increases with the petal width and that there seems to be two clusters of data within the plots, one sitting above the other.
Note
When applying PCA, it is important to keep in mind the size of the data being modelled, as well as the available system memory. The singular value decomposition process involves separating the data into the eigenvalues and eigenvectors, and can be quite memory intensive. If the dataset is too large, you may either be unable to complete the process, suffer significant performance loss, or lock up your system.