Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with Python

You're reading from   Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Arrow left icon
Product type Paperback
Published in May 2019
Publisher
ISBN-13 9781789952292
Length 482 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (3):
Arrow left icon
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Christopher Kruger Christopher Kruger
Author Profile Icon Christopher Kruger
Christopher Kruger
Aaron Jones Aaron Jones
Author Profile Icon Aaron Jones
Aaron Jones
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Applied Unsupervised Learning with Python
Preface
1. Introduction to Clustering 2. Hierarchical Clustering FREE CHAPTER 3. Neighborhood Approaches and DBSCAN 4. Dimension Reduction and PCA 5. Autoencoders 6. t-Distributed Stochastic Neighbor Embedding (t-SNE) 7. Topic Modeling 8. Market Basket Analysis 9. Hotspot Analysis Appendix

Chapter 3: Neighborhood Approaches and DBSCAN


Activity 4: Implement DBSCAN from Scratch

Solution:

  1. Generate a random cluster dataset as follows:

    from sklearn.cluster import DBSCAN
    from sklearn.datasets import make_blobs
    import matplotlib.pyplot as plt
    import numpy as np
    %matplotlib inline
    
    X_blob, y_blob = make_blobs(n_samples=500, centers=4, n_features=2, random_state=800)
  2. Visualize the generated data:

    plt.scatter(X_blob[:,0], X_blob[:,1])
    plt.show()

    The output is as follows:

    Figure 3.14: Plot of generated data

  3. Create functions from scratch that allow you to call DBSCAN on a dataset:

    def scratch_DBSCAN(x, eps, min_pts):
        """
        param x (list of vectors): your dataset to be clustered
        param eps (float): neigborhood radius threshold
        param min_pts (int): minimum number of points threshold for a nieghborhood to be a cluster
        """
    
         # Build a label holder that is comprised of all 0s
        labels = [0]* x.shape[0]
    
        # Arbitrary starting "current cluster" ID    
        C = 0
        
        # For each point p in x...
        # ('p' is the index of the datapoint, rather than the datapoint itself.)
        for p in range(0, x.shape[0]):
        
            # Only unvisited points can be evaluated as neighborhood centers
            if not (labels[p] == 0):
                continue
            
            # Find all of p's neighbors.
            neighbors = neighborhood_search(x, p, eps)
            
            # If there are not enough neighbor points, then it is classified as noise (-1).
            # Otherwise we can use this point as a neighborhood cluster
            if len(neighbors) < min_pts:
                labels[p] = -1    
            else: 
                C += 1
                neighbor_cluster(x, labels, p, neighbors, C, eps, min_pts)
        
        return labels
    
    
    def neighbor_cluster(x, labels, p, neighbors, C, eps, min_pts):
    
        # Assign the cluster label to original point
        labels[p] = C
        
        # Look at each neighbor of p (by index, not the points themselves) and evaluate
        i = 0
        while i < len(neighbors):    
            
            # Get the next point from the queue.        
            potential_neighbor_ix = neighbors[i]
           
            # If potential_neighbor_ix is noise from previous runs, we can assign it to current cluster
            if labels[potential_neighbor_ix] == -1:
                labels[potential_neighbor_ix] = C
            
            # Otherwise, if potential_neighbor_ix is unvisited, we can add it to current cluster
            elif labels[potential_neighbor_ix] == 0:
                labels[potential_neighbor_ix] = C
                
                # Further find neighbors of potential neighbor
                potential_neighbors_cluster = neighborhood_search(x, potential_neighbor_ix, eps)
                
                if len(potential_neighbors_cluster) >= min_pts:
                    neighbors = neighbors + potential_neighbors_cluster      
            
            # Evaluate next neighbor
            i += 1        
    
    def neighborhood_search(x, p, eps):
        neighbors = []
        
        # For each point in the dataset...
        for potential_neighbor in range(0, x.shape[0]):
            
            # If a nearby point falls below the neighborhood radius threshold, add to neighbors list
            if np.linalg.norm(x[p] - x[potential_neighbor]) < eps:
                neighbors.append(potential_neighbor)
                
        return neighbors
  4. Use your created DBSCAN implementation to find clusters in the generated dataset. Feel free to use hyperparameters as you see fit, tuning them based on their performance in step five:

    labels = scratch_DBSCAN(X_blob, 0.6, 5)
  5. Visualize the clustering performance of your DBSCAN implementation from scratch:

    plt.scatter(X_blob[:,0], X_blob[:,1], c=labels)
    plt.title("DBSCAN from Scratch Performance")
    plt.show()

    The output is as follows:

    Figure 3.15: Plot of DBSCAN implementation

As you may have noticed, it takes quite some time for a custom implementation to run. This is because we explored the non-vectorized version of this algorithm for the sake of clarity. Moving forward, you should aim to use the DBSCAN implementation provided by scikit-learn, as it is highly optimized.

Activity 5: Comparing DBSCAN with k-means and Hierarchical Clustering

Solution:

  1. Import the necessary packages:

    from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
    from sklearn.metrics import silhouette_score
    import pandas as pd
    import matplotlib.pyplot as plt
    %matplotlib inline
  2. Load the wine dataset from Chapter 2, Hierarchical Clustering and familiarize yourself again with what the data looks like:

    # Load Wine data set
    wine_df = pd.read_csv("../CH2/wine_data.csv")
    
    # Show sample of data set
    print(wine_df.head())

    The output is as follows:

    Figure 3.16: First five rows of wine dataset

  3. Visualize the data:

    plt.scatter(wine_df.values[:,0], wine_df.values[:,1])
    plt.title("Wine Dataset")
    plt.xlabel("OD Reading")
    plt.ylabel("Proline")
    plt.show()

    The output is as follows:

    Figure 3.17: Plot of the data

  4. Generate clusters using k-means, agglomerative clustering, and DBSCAN:

    # Generate clusters from K-Means
    km = KMeans(3)
    km_clusters = km.fit_predict(wine_df)
    
    # Generate clusters using Agglomerative Hierarchical Clustering
    ac = AgglomerativeClustering(3, linkage='average')
    ac_clusters = ac.fit_predict(wine_df)
  5. Evaluate a few different options for DSBSCAN hyperparameters and their effect on the silhouette score:

    db_param_options = [[20,5],[25,5],[30,5],[25,7],[35,7],[35,3]]
    
    for ep,min_sample in db_param_options:
        # Generate clusters using DBSCAN
        db = DBSCAN(eps=ep, min_samples = min_sample)
        db_clusters = db.fit_predict(wine_df)
        print("Eps: ", ep, "Min Samples: ", min_sample)
        print("DBSCAN Clustering: ", silhouette_score(wine_df, db_clusters))

    The output is as follows:

    Figure 3.18: Printing the silhouette score for clusters

  6. Generate the final clusters based on the highest silhouette score (eps: 35, min_samples: 3):

    # Generate clusters using DBSCAN
    db = DBSCAN(eps=35, min_samples = 3)
    db_clusters = db.fit_predict(wine_df)
  7. Visualize clusters generated using each of the three methods:

    plt.title("Wine Clusters from K-Means")
    plt.scatter(wine_df['OD_read'], wine_df['Proline'], c=km_clusters,s=50, cmap='tab20b')
    plt.show()
    
    plt.title("Wine Clusters from Agglomerative Clustering")
    plt.scatter(wine_df['OD_read'], wine_df['Proline'], c=ac_clusters,s=50, cmap='tab20b')
    plt.show()
    
    plt.title("Wine Clusters from DBSCAN")
    plt.scatter(wine_df['OD_read'], wine_df['Proline'], c=db_clusters,s=50, cmap='tab20b')
    plt.show()

    The output is as follows:

    Figure 3.19: Plot of clusters using different algorithms

  8. Evaluate the silhouette score of each approach:

    # Calculate Silhouette Scores
    print("Silhouette Scores for Wine Dataset:\n")
    print("K-Means Clustering: ", silhouette_score(wine_df, km_clusters))
    print("Agg Clustering: ", silhouette_score(wine_df, ac_clusters))
    print("DBSCAN Clustering: ", silhouette_score(wine_df, db_clusters))

    The output is as follows:

    Figure 3.20: Silhouette score

As you can see, DBSCAN isn't automatically the best choice for your clustering needs. One key trait that makes it different form other algorithms is the use of noise as a potential clustering. In some cases, this is great, as it removes outliers, however, there may be situations where it is not tuned well enough and classifies too many points as noise. Can you improve the silhouette score by tuning the hyperparameters?

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime