Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Python Feature Engineering Cookbook
Python Feature Engineering Cookbook

Python Feature Engineering Cookbook: Over 70 recipes for creating, engineering, and transforming features to build machine learning models

eBook
$20.98 $29.99
Paperback
$43.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Python Feature Engineering Cookbook

Imputing Missing Data

Missing data refers to the absence of values for certain observations and is an unavoidable problem in most data sources. Scikit-learn does not support missing values as input, so we need to remove observations with missing data or transform them into permitted values. The act of replacing missing data with statistical estimates of missing values is called imputation. The goal of any imputation technique is to produce a complete dataset that can be used to train machine learning models. There are multiple imputation techniques we can apply to our data. The choice of imputation technique we use will depend on whether the data is missing at random, the number of missing values, and the machine learning model we intend to use. In this chapter, we will discuss several missing data imputation techniques.

This chapter...

Technical requirements

In this chapter, we will use the Python libraries: pandas, NumPy and scikit-learn. I recommend installing the free Anaconda Python distribution (https://www.anaconda.com/distribution/), which contains all these packages.

For details on how to install the Python Anaconda distribution, visit the Technical requirements section in Chapter 1, Foreseeing Variable Problems When Building ML Models.

We will also use the open source Python library called Feature-engine, which I created and can be installed using pip:

pip install feature-engine

To learn more about Feature-engine, visit the following sites:

Check that you have installed the right versions of the numerical Python libraries, which...

Removing observations with missing data

Complete Case Analysis (CCA), also called list-wise deletion of cases, consists of discarding those observations where the values in any of the variables are missing. CCA can be applied to categorical and numerical variables. CCA is quick and easy to implement and has the advantage that it preserves the distribution of the variables, provided the data is missing at random and only a small proportion of the data is missing. However, if data is missing across many variables, CCA may lead to the removal of a big portion of the dataset.

How to do it...

Let's begin by loading pandas and the dataset:

  1. First, we'll import the pandas library:
import pandas...

Performing mean or median imputation

Mean or median imputation consists of replacing missing values with the variable mean or median. This can only be performed in numerical variables. The mean or the median is calculated using a train set, and these values are used to impute missing data in train and test sets, as well as in future data we intend to score with the machine learning model. Therefore, we need to store these mean and median values. Scikit-learn and Feature-engine transformers learn the parameters from the train set and store these parameters for future use. So, in this recipe, we will learn how to perform mean or median imputation using the scikit-learn and Feature-engine libraries and pandas for comparison.

Use mean imputation if variables are normally distributed and median imputation otherwise. Mean and median imputation may distort the distribution of the...

Implementing mode or frequent category imputation

Mode imputation consists of replacing missing values with the mode. We normally use this procedure in categorical variables, hence the frequent category imputation name. Frequent categories are estimated using the train set and then used to impute values in train, test, and future datasets. Thus, we need to learn and store these parameters, which we can do using scikit-learn and Feature-engine's transformers; in the following recipe, we will learn how to do so.

If the percentage of missing values is high, frequent category imputation may distort the original distribution of categories.

How to do it...

To begin, let's make a few imports and prepare the data:

  1. Let&apos...

Replacing missing values with an arbitrary number

Arbitrary number imputation consists of replacing missing values with an arbitrary value. Some commonly used values include 999, 9999, or -1 for positive distributions. This method is suitable for numerical variables. A similar method for categorical variables will be discussed in the Capturing missing values in a bespoke category recipe.

When replacing missing values with an arbitrary number, we need to be careful not to select a value close to the mean or the median, or any other common value of the distribution.

Arbitrary number imputation can be used when data is not missing at random, when we are building non-linear models, and when the percentage of missing data is high. This imputation technique distorts the original variable distribution.

In this recipe, we will impute missing data by arbitrary numbers using pandas, scikit...

Capturing missing values in a bespoke category

Missing data in categorical variables can be treated as a different category, so it is common to replace missing values with the Missing string. In this recipe, we will learn how to do so using pandas, scikit-learn, and Feature-engine.

How to do it...

To proceed with the recipe, let's import the required tools and prepare the dataset:

  1. Import pandas and the required functions and classes from scikit-learn and Feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from feature_engine.missing_data_imputers import CategoricalVariableImputer
  1. Let's load the dataset:
data = pd.read_csv('creditApprovalUCI...

Replacing missing values with a value at the end of the distribution

Replacing missing values with a value at the end of the variable distribution is equivalent to replacing them with an arbitrary value, but instead of identifying the arbitrary values manually, these values are automatically selected as those at the very end of the variable distribution. The values that are used to replace missing information are estimated using the mean plus or minus three times the standard deviation if the variable is normally distributed, or the inter-quartile range (IQR) proximity rule otherwise. According to the IQR proximity rule, missing values will be replaced with the 75th quantile + (IQR * 1.5) at the right tail or by the 25th quantile - (IQR * 1.5) at the left tail. The IQR is given by the 75th quantile - the 25th quantile.

Some users will also identify the minimum...

Implementing random sample imputation

Random sampling imputation consists of extracting random observations from the pool of available values in the variable. Random sampling imputation preserves the original distribution, which differs from the other imputation techniques we've discussed in this chapter and is suitable for numerical and categorical variables alike. In this recipe, we will implement random sample imputation with pandas and Feature-engine.

How to do it...

Let's begin by importing the required libraries and tools and preparing the dataset:

  1. Let's import pandas, the train_test_split function from scikit-learn, and RandomSampleImputer from Feature-engine:
import pandas as pd
from...

Adding a missing value indicator variable

A missing indicator is a binary variable that specifies whether a value was missing for an observation (1) or not (0). It is common practice to replace missing observations by the mean, median, or mode while flagging those missing observations with a missing indicator, thus covering two angles: if the data was missing at random, this would be contemplated by the mean, median, or mode imputation, and if it wasn't, this would be captured by the missing indicator. In this recipe, we will learn how to add missing indicators using NumPy, scikit-learn, and Feature-engine.

Getting ready

For an example of the implementation of missing indicators, along with mean imputation...

Performing multivariate imputation by chained equations

Multivariate imputation methods, as opposed to univariate imputation, use the entire set of variables to estimate the missing values. In other words, the missing values of a variable are modeled based on the other variables in the dataset. Multivariate imputation by chained equations (MICE) is a multiple imputation technique that models each variable with missing values as a function of the remaining variables and uses that estimate for imputation. MICE has the following basic steps:

  1. A simple univariate imputation is performed for every variable with missing data, for example, median imputation.
  2. One specific variable is selected, say, var_1, and the missing values are set back to missing.
  3. A model that's used to predict var_1 is built based on the remaining variables in the dataset.
  4. The missing values...

Assembling an imputation pipeline with scikit-learn

Datasets often contain a mix of numerical and categorical variables. In addition, some variables may contain a few missing data points, while others will contain quite a big proportion. The mechanisms by which data is missing may also vary among variables. Thus, we may wish to perform different imputation procedures for different variables. In this recipe, we will learn how to perform different imputation procedures for different feature subsets using scikit-learn.

How to do it...

To proceed with the recipe, let's import the required libraries and classes and prepare the dataset:

  1. Let's import pandas and the required classes from scikit-learn:
import...

Assembling an imputation pipeline with Feature-engine

Feature-engine is an open source Python library that allows us to easily implement different imputation techniques for different feature subsets. Often, our datasets contain a mix of numerical and categorical variables, with few or many missing values. Therefore, we normally perform different imputation techniques on different variables, depending on the nature of the variable and the machine learning algorithm we want to build. With Feature-engine, we can assemble multiple imputation techniques in a single step, and in this recipe, we will learn how to do this.

How to do it...

Let's begin by importing the necessary Python libraries and preparing the data:

  1. Let&apos...
Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Discover solutions for feature generation, feature extraction, and feature selection
  • Uncover the end-to-end feature engineering process across continuous, discrete, and unstructured datasets
  • Implement modern feature extraction techniques using Python's pandas, scikit-learn, SciPy and NumPy libraries

Description

Feature engineering is invaluable for developing and enriching your machine learning models. In this cookbook, you will work with the best tools to streamline your feature engineering pipelines and techniques and simplify and improve the quality of your code. Using Python libraries such as pandas, scikit-learn, Featuretools, and Feature-engine, you’ll learn how to work with both continuous and discrete datasets and be able to transform features from unstructured datasets. You will develop the skills necessary to select the best features as well as the most suitable extraction techniques. This book will cover Python recipes that will help you automate feature engineering to simplify complex processes. You’ll also get to grips with different feature engineering strategies, such as the box-cox transform, power transform, and log transform across machine learning, reinforcement learning, and natural language processing (NLP) domains. By the end of this book, you’ll have discovered tips and practical solutions to all of your feature engineering problems.

Who is this book for?

This book is for machine learning professionals, AI engineers, data scientists, and NLP and reinforcement learning engineers who want to optimize and enrich their machine learning models with the best features. Knowledge of machine learning and Python coding will assist you with understanding the concepts covered in this book.

What you will learn

  • Simplify your feature engineering pipelines with powerful Python packages
  • Get to grips with imputing missing values
  • Encode categorical variables with a wide set of techniques
  • Extract insights from text quickly and effortlessly
  • Develop features from transactional data and time series data
  • Derive new features by combining existing variables
  • Understand how to transform, discretize, and scale your variables
  • Create informative variables from date and time

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 22, 2020
Length: 372 pages
Edition : 1st
Language : English
ISBN-13 : 9781789806311
Category :
Languages :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jan 22, 2020
Length: 372 pages
Edition : 1st
Language : English
ISBN-13 : 9781789806311
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 147.97
Mastering pandas
$48.99
Python Machine Learning
$54.99
Python Feature Engineering Cookbook
$43.99
Total $ 147.97 Stars icon

Table of Contents

12 Chapters
Foreseeing Variable Problems When Building ML Models Chevron down icon Chevron up icon
Imputing Missing Data Chevron down icon Chevron up icon
Encoding Categorical Variables Chevron down icon Chevron up icon
Transforming Numerical Variables Chevron down icon Chevron up icon
Performing Variable Discretization Chevron down icon Chevron up icon
Working with Outliers Chevron down icon Chevron up icon
Deriving Features from Dates and Time Variables Chevron down icon Chevron up icon
Performing Feature Scaling Chevron down icon Chevron up icon
Applying Mathematical Computations to Features Chevron down icon Chevron up icon
Creating Features with Transactional and Time Series Data Chevron down icon Chevron up icon
Extracting Features from Text Variables Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.6
(9 Ratings)
5 star 44.4%
4 star 22.2%
3 star 0%
2 star 11.1%
1 star 22.2%
Filter icon Filter
Top Reviews

Filter reviews by




Amazon Customer Nov 14, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Thorough recollection of feature transformations to tackle multiple aspects of data quality and to extract features from different data formats, like text, time series and transactions. Great resource to have at hand when in front of a new dataset.
Amazon Verified review Amazon
Omar Pasha Mar 26, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I was exactly what I needed to know!
Amazon Verified review Amazon
Shorsh Nov 11, 2020
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book contains all the recipes that are needed for any aspiring data scientist. It contains very good examples that are easy to follow with a good theory explanation on what you are doing.Some basic python knowledge is needed before hand as it wont start from scratch, it is assumed that you have already faced issues with your feature engineering pipelines.The author of this book has created a master piece of art with the feature engineering library, very easy to use and with awesome results.This book became one of my favorite ones very fast!! A must read if you are pursuing a DS/ML/AI position
Amazon Verified review Amazon
Kevin Nov 29, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As other reviews have stated the book delivers what it says it will; Python code that generates a lot of feature-engineering. I find this book to be fantastic, and Sole's work overall, as it gives life to new feature-engineering possibilities and does it fast. Long gone are the days of writing your own custom transformers or unique time-series features. This book automates a lot of that headache and will absolutely be the first reference I go to when I need to handle a new feature. I personally hadn't dealt with tsfresh prior to reading through and it brought to life instantaneous time-series features I no longer have to write scripts for. A very happy customer on that knowledge alone! Per usual, Sole continues to advance the ML community for the betterment of all.
Amazon Verified review Amazon
jml Sep 23, 2020
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
The Python Feature Engineering Cookbook (PFEC) delivers exactly what the name implies. It’s a collection of recipes targeted at specific tasks; if you’re working in an AI or ML environment and have a need to massage variable data, handle math functions, or normalize data strings, this book will quickly earn a place on your shelf. Each recipe is presented in a standardized format that walks you through the theory and implementation of the code performing the function. Short introductions and appropriate external references provide background for every task, and as long as you have a reasonable familiarity with pandas, scikit-learn, Numpy, Python, and Jupyter, you’ll find a number of uses for the techniques covered.It’s not designed to be a tutorial for those just starting out with machine learning, and isn’t written in a style that invites casual reading. The material tends toward the dry side. While the author does an admirable job of distilling the necessary information into the basic framework of prepare-perform-review, PFEC definitely falls into the reference book category as opposed to being a guide for the uninitiated.In short, you’ll want to have PFEC around if you’re involved in a project that requires hands-on data manipulation in a Python machine-learning environment. Paired with a good guide to ML basics and implementation, it’ll keep you from reinventing quite a few wheels.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.