Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science  with Python

You're reading from   Data Science with Python Combine Python with machine learning principles to discover hidden patterns in raw data

Arrow left icon
Product type Paperback
Published in Jul 2019
Publisher Packt
ISBN-13 9781838552862
Length 426 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Rohan Chopra Rohan Chopra
Author Profile Icon Rohan Chopra
Rohan Chopra
Mohamed Noordeen Alaudeen Mohamed Noordeen Alaudeen
Author Profile Icon Mohamed Noordeen Alaudeen
Mohamed Noordeen Alaudeen
Aaron England Aaron England
Author Profile Icon Aaron England
Aaron England
Arrow right icon
View More author details
Toc

Table of Contents (10) Chapters Close

About the Book 1. Introduction to Data Science and Data Pre-Processing FREE CHAPTER 2. Data Visualization 3. Introduction to Machine Learning via Scikit-Learn 4. Dimensionality Reduction and Unsupervised Learning 5. Mastering Structured Data 6. Decoding Images 7. Processing Human Language 8. Tips and Tricks of the Trade 1. Appendix

Chapter 4: Dimensionality Reduction and Unsupervised Learning

Activity 12: Ensemble k-means Clustering and Calculating Predictions

Solution:

After the glass dataset has been imported, shuffled, and standardized (see Exercise 58):

  1. Instantiate an empty data frame to append each model and save it as the new data frame object labels_df with the following code:

    import pandas as pd

    labels_df = pd.DataFrame()

  2. Import the KMeans function outside of the loop using the following:

    from sklearn.cluster import KMeans

  3. Complete 100 iterations as follows:

    for i in range(0, 100):

  4. Save a KMeans model object with two clusters (arbitrarily decided upon, a priori) using:

    model = KMeans(n_clusters=2)

  5. Fit the model to scaled_features using the following:

    model.fit(scaled_features)

  6. Generate the labels array and save it as the labels object, as follows:

    labels = model.labels_

  7. Store labels as a column in labels_df named after the iteration using the code:

    labels_df['Model_{}_Labels'.format(i+1)] = labels

  8. After labels have been generated for each of the 100 models (see Activity 21), calculate the mode for each row using the following code:

    row_mode = labels_df.mode(axis=1)

  9. Assign row_mode to a new column in labels_df, as shown in the following code:

    labels_df['row_mode'] = row_mode

  10. View the first five rows of labels_df

    print(labels_df.head(5))

Figure 4.24: First five rows of labels_df
Figure 4.24: First five rows of labels_df

We have drastically increased the confidence in our predictions by iterating through numerous models, saving the predictions at each iteration, and assigning the final predictions as the mode of these predictions. However, these predictions were generated by models using a predetermined number of clusters. Unless we know the number of clusters a priori, we will want to discover the optimal number of clusters to segment our observations.

Activity 13: Evaluating Mean Inertia by Cluster after PCA Transformation

Solution:

  1. Instantiate a PCA model with the value for the n_components argument equal to best_n_components (that is, remember, best_n_components = 6) as follows:

    from sklearn.decomposition import PCA

    model = PCA(n_components=best_n_components)

  2. Fit the model to scaled_features and transform them into the six components, as shown here:

    df_pca = model.fit_transform(scaled_features)

  3. Import numpy and the KMeans function outside the loop using the following code:

    from sklearn.cluster import KMeans

    import numpy as np

  4. Instantiate an empty list, inertia_list, for which we will append inertia values after each iteration using the following code:

    inertia_list = []

  5. In the inside for loop, we will iterate through 100 models as follows:

    for i in range(100):

  6. Build our KMeans model with n_clusters=x using:

    model = KMeans(n_clusters=x)

    Note

    The value for x will be dictated by the outer loop which is covered in detail here.

  7. Fit the model to df_pca as follows:

    model.fit(df_pca)

  8. Get the inertia value and save it to the object inertia using the following code:

    inertia = model.inertia_

  9. Append inertia to inertia_list using the following code:

    inertia_list.append(inertia)

  10. Moving to the outside loop, instantiate another empty list to store the average inertia values using the following code:

    mean_inertia_list_PCA = []

  11. Since we want to check the average inertia over 100 models for n_clusters 1 through 10, we will instantiate the outer loop as follows:

    for x in range(1, 11):

  12. After the inside loop has run through its 100 iterations, and the inertia value for each of the 100 models have been appended to inertia_list, compute the mean of this list, and save the object as mean_inertia using the following code:

    mean_inertia = np.mean(inertia_list)

  13. Append mean_inertia to mean_inertia_list_PCA using the following code:

    mean_inertia_list_PCA.append(mean_inertia)

  14. Print mean_inertia_list_PCA to the console using the following code:

    print(mean_inertia_list_PCA)

  15. Notice the output in the following screenshot:
Figure 4.25: mean_inertia_list_PCA
Figure 4.25: mean_inertia_list_PCA
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime