PCA with H2O
We can also use the PCA implementation provided by H2O. (We've already seen H2O in the previous chapter and mentioned it along the book.)
With H2O, we first need to turn on the server with the init
method. Then, we dump the dataset on a file (precisely, a CSV file) and finally run the PCA analysis. As the last step, we shut down the server.
We're trying this implementation on some of the biggest datasets seen so far—the one with 100K observations and 100 features and the one with 10K observations and 2,500 features:
In: import h2o from h2o.transforms.decomposition import H2OPCA h2o.init(max_mem_size_GB=4) def testH2O_pca(nrows, ncols, k=20): temp_file = tempfile.NamedTemporaryFile().name X, _ = make_blobs(nrows, n_features=ncols, random_state=101) np.savetxt(temp_file, np.c_[X], delimiter=",") del X pca = H2OPCA(k=k, transform="NONE", pca_method="Power") tik = time.time() pca.train(x=range(100), \ training_frame=h2o.import_file(temp_file)) print "H2OPCA...