Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Cleaning and Exploration with Machine Learning

You're reading from   Data Cleaning and Exploration with Machine Learning Get to grips with machine learning techniques to achieve sparkling-clean data quickly

Arrow left icon
Product type Paperback
Published in Aug 2022
Publisher Packt
ISBN-13 9781803241678
Length 542 pages
Edition 1st Edition
Arrow right icon
Author (1):
Arrow left icon
Michael Walker Michael Walker
Author Profile Icon Michael Walker
Michael Walker
Arrow right icon
View More author details
Toc

Table of Contents (23) Chapters Close

Preface 1. Section 1 – Data Cleaning and Machine Learning Algorithms
2. Chapter 1: Examining the Distribution of Features and Targets FREE CHAPTER 3. Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets 4. Chapter 3: Identifying and Fixing Missing Values 5. Section 2 – Preprocessing, Feature Selection, and Sampling
6. Chapter 4: Encoding, Transforming, and Scaling Features 7. Chapter 5: Feature Selection 8. Chapter 6: Preparing for Model Evaluation 9. Section 3 – Modeling Continuous Targets with Supervised Learning
10. Chapter 7: Linear Regression Models 11. Chapter 8: Support Vector Regression 12. Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression 13. Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning
14. Chapter 10: Logistic Regression 15. Chapter 11: Decision Trees and Random Forest Classification 16. Chapter 12: K-Nearest Neighbors for Classification 17. Chapter 13: Support Vector Machine Classification 18. Chapter 14: Naïve Bayes Classification 19. Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning
20. Chapter 15: Principal Component Analysis 21. Chapter 16: K-Means and DBSCAN Clustering 22. Other Books You May Enjoy

What this book covers

Chapter 1, Examining the Distribution of Features and Targets, explores using common NumPy and pandas techniques to get a better sense of the attributes of our data. We will generate summary statistics, such as mean, min, and max, and standard deviation, and count the number of missings. We will also create visualizations of key features, including histograms and boxplots, to give us a better sense of the distribution of each feature than we can get by just looking at summary statistics. We will hint at the implications of feature distribution for data transformation, encoding and scaling, and the modeling that we will be doing in subsequent chapters with the same data.

Chapter 2, Examining Bivariate and Multivariate Relationships between Features and Targets, focuses on the correlation between possible features and target variables. We will use pandas methods for bivariate analysis, and Matplotlib for visualizations. We will discuss the implications of what we find for feature engineering and modeling. We also use multivariate techniques in this chapter to understand the relationship between features.

Chapter 3, Identifying and Fixing Missing Values, goes over techniques for identifying missing values for each feature or target, and for identifying observations where values for a large number of the features are absent. We will explore strategies for imputing values, such as setting values to the overall mean, to the mean for a given category, or forward filling. We will also examine multivariate techniques for imputing values for missings and discuss when they are appropriate.

Chapter 4, Encoding, Transforming, and Scaling Features, covers a range of feature engineering techniques. We will use tools to drop redundant or highly correlated features. We will explore the most common kinds of encoding – one-hot, ordinal, and hashing encoding. We will also use transformations to improve the distribution of our features. Finally, we will use common binning and scaling approaches to address skew, kurtosis, and outliers, and to adjust for features with widely different ranges.

Chapter 5, Feature Selection will go over a number of feature selection methods, from filter, to wrapper, to embedded methods. We will explore how they work with categorical and continuous targets. For wrapper and embedded methods, we consider how well they work with different algorithms.

Chapter 6, Preparing for Model Evaluation, will see us build our first full-fledged pipeline, separating our data into testing and training datasets, and learning how to do preprocessing without data leakage. We will implement cross-validation with k-fold and look more closely into assessing the performance of our models.

Chapter 7, Linear Regression Models, is the first of several chapters on building regression models with an old favorite of many data scientists, linear regression. We will run a classical linear model while also examining the qualities of a feature space that make it a good candidate for a linear model. We will explore how to improve linear models, when necessary, with regularization and transformations. We will look into stochastic gradient descent as an alternative to ordinary least square (OLS) optimization. We will also learn how to do hyperparameter tuning with grid searches.

Chapter 8, Support Vector Regression, discusses key support vector machine concepts and how they can be applied to regression problems. In particular, we will examine how concepts such as epsilon-insensitive tubes and soft margins can give us the flexibility to get the best fit possible, given our data and domain-related challenges. We will also explore, for the first time but definitely not the last, the very handy kernel trick, which allows us to model nonlinear relationships without transformations or increasing the number of features.

Chapter 9, K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression, explores some of the most popular non-parametric regression algorithms. We will discuss the advantages of each algorithm, when you might want to choose one over the other, and possible modeling challenges. These challenges include how to avoid underfitting and overfitting with careful adjusting of hyperparameters.

Chapter 10, Logistic Regression, is the first of several chapters on building classification models with logistic regression, an efficient algorithm with low bias. We will carefully examine the assumptions of logistic regression and discuss the attributes of a dataset and a modeling problem that make logistic regression a good choice. We will use regularization to address high variance or when we have a number of highly correlated predictors. We will extend the algorithm to multiclass problems with multinomial logistic regression. We will also discuss how to handle class imbalance for the first, but not the last, time.

Chapter 11, Decision Trees and Random Forest Classification, returns to the decision tree and random forest algorithms that were introduced in Chapter 9, K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression, this time dealing with classification problems. This gives us another opportunity to learn how to construct and interpret decision trees. We will adjust key hyperparameters, including the depth of trees, to avoid overfitting. We will then explore random forest and gradient boosted decision trees as good, lower variance alternatives to decision trees.

Chapter 12, K-Nearest Neighbors for Classification, returns to k-nearest neighbors (KNNs) to handle both binary and multiclass modeling problems. We will discuss and demonstrate the advantages of KNN – how easy it is to build a no-frills model and the limited number of hyperparameters to adjust. By the end of the chapter, we will know both – how to do KNN and when we should consider it for our modeling.

Chapter 13, Support Vector Machine Classification, explores different strategies for implementing support vector classification (SVC). We will use linear SVC, which can perform very well when our classes are linearly separable. We will then examine how to use the kernel trick to extend SVC to cases where the classes are not linearly separable. Finally, we will use one-versus-one and one-versus-rest classification to handle targets with more than two values.

Chapter 14, Naïve Bayes Classification, discusses the fundamental assumptions of naïve Bayes in this chapter and how the algorithm is used to tackle some of the modeling challenges we have already explored, as well as some new ones, such as text classification. We will consider when naïve Bayes is a good option and when it is not. We will also examine the interpretation of naïve Bayes models.

Chapter 15, Principal Component Analysis, examines principal component analysis (PCA), including how it works and when we might want to use it. We will learn how to interpret the components created from PCA, including how each feature contributes to each component and how much of the variance is explained. We will learn how to visualize components and how to use components in subsequent analyses. We will also examine how to use kernels for PCA and when that might give us better results.

Chapter 16, K-Means and DBSCAN Clustering, explores two popular clustering techniques, k-means and Density-based spatial clustering of applications with noise (DBSCAN). We will discuss the strengths of each approach and develop a sense of when to choose one clustering algorithm over the other. We will also learn how to evaluate our clusters and how to change hyperparameters to improve our model.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime