How centering and scaling data affects PCA
As with many of the transformations that we have worked with previously in this text, the scaling of features tends to matter a great deal to the transformations. PCA is no different. Previously, we mentioned that the scikit-learn version of PCA automatically centers data in the prediction phase, but why doesn't it do so at the fitting time? If the scikit-learn PCA module goes through the trouble of centering data in the predict method, why doesn't it do so while calculating the eigenvectors? The hypothesis here is that centering data doesn't affect the principal components. Let's test this:
- Let's import out
StandardScaler
module from scikit-learn and center theiris
dataset:
# import our scaling module from sklearn.preprocessing import StandardScaler # center our data, not a full scaling X_centered = StandardScaler(with_std=False).fit_transform(iris_X) X_centered[:5,] array([[-0.74333333, 0.446 , -2.35866667, -0.99866667], [-0.94333333, -0...