You're reading from Data Cleaning and Exploration with Machine Learning Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Product type Paperback

Published in Aug 2022

Publisher Packt

ISBN-13 9781803241678

Length 542 pages

Edition 1st Edition

Concepts

Machine Learning

Author (1):

Michael Walker

View More author details

Table of Contents (23) Chapters

Preface

1. Section 1 – Data Cleaning and Machine Learning Algorithms

2. Chapter 1: Examining the Distribution of Features and Targets FREE CHAPTER

3. Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets

4. Chapter 3: Identifying and Fixing Missing Values

5. Section 2 – Preprocessing, Feature Selection, and Sampling

6. Chapter 4: Encoding, Transforming, and Scaling Features

7. Chapter 5: Feature Selection

8. Chapter 6: Preparing for Model Evaluation

9. Section 3 – Modeling Continuous Targets with Supervised Learning

10. Chapter 7: Linear Regression Models

11. Chapter 8: Support Vector Regression

12. Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression

13. Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning

14. Chapter 10: Logistic Regression

15. Chapter 11: Decision Trees and Random Forest Classification

16. Chapter 12: K-Nearest Neighbors for Classification

17. Chapter 13: Support Vector Machine Classification

18. Chapter 14: Naïve Bayes Classification

19. Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning

20. Chapter 15: Principal Component Analysis

21. Chapter 16: K-Means and DBSCAN Clustering

22. Other Books You May Enjoy

Implementing DBSCAN clustering

DBSCAN is a very flexible approach to clustering. We just need to specify a value for ɛ, also referred to as eps. As we have discussed, the ɛ value determines the size of the ɛ-neighborhood around an instance. The minimum samples hyperparameter indicates how many instances around an instance are needed for it to be considered a core instance.

Note

We use DBSCAN to cluster the same income gap data that we worked with in the previous section.

Let’s build a DBSCAN clustering model:

We start by loading familiar libraries, plus the DBSCAN module:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.cluster import DBSCAN
from sklearn.impute import KNNImputer
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import os
import sys
sys.path.append(os.getcwd() + "/helperfunctions")