You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Oct 2022

Publisher Packt

ISBN-13 9781804611302

Length 386 pages

Edition 2nd Edition

Languages

Python

Tools

Combine

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Table of Contents (14) Chapters

Preface

1. Chapter 1: Imputing Missing Data

2. Chapter 2: Encoding Categorical Variables FREE CHAPTER

3. Chapter 3: Transforming Numerical Variables

4. Chapter 4: Performing Variable Discretization

5. Chapter 5: Working with Outliers

6. Chapter 6: Extracting Features from Date and Time Variables

7. Chapter 7: Performing Feature Scaling

8. Chapter 8: Creating New Features

9. Chapter 9: Extracting Features from Relational Data with Featuretools

10. Chapter 10: Creating Features from a Time Series with tsfresh

11. Chapter 11: Extracting Features from Text Variables

12. Index

Why subscribe?

13. Other Books You May Enjoy

Technical requirements

In this chapter, we will use the pandas, NumPy, and Matplotlib Python libraries, as well as scikit-learn and Feature-engine. For guidelines on how to obtain these libraries, visit the Technical requirements section of Chapter 1, Imputing Missing Data.

We will also use the open-source Category Encoders Python library, which can be installed using pip:

pip install category_encoders

To learn more about Category Encoders, visit the following link: https://contrib.scikit-learn.org/category_encoders/.

We will also use the Credit Approval dataset, which is available in the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/credit+approval.

To prepare the dataset, follow these steps:

Visit http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/ and click on crx.data to download the data:

Figure 2.1 – The index directory for the Credit Approval dataset

Save crx.data to the folder where you will run the following commands.

After downloading the data, open up a Jupyter Notebook and run the following commands.

Import the required libraries:

import random
import numpy as np
import pandas as pd

Load the data:

data = pd.read_csv("crx.data", header=None)

Create a list containing the variable names:
```
varnames = [f"A{s}" for s in range(1, 17)]
```
Add the variable names to the DataFrame:
```
data.columns = varnames
```
Replace the question marks in the dataset with NumPy NaN values:
```
data = data.replace("?", np.nan)
```

Cast some numerical variables as float data types:

data["A2"] = data["A2"].astype("float")
data["A14"] = data["A14"].astype("float")

Encode the target variable as binary:

data["A16"] = data["A16"].map({"+": 1, "-": 0})

Rename the target variable:

data.rename(columns={"A16": "target"}, inplace=True)

Make lists that contain categorical and numerical variables:

cat_cols = [
    c for c in data.columns if data[c].dtypes=="O"] 
num_cols = [
    c for c in data.columns if data[c].dtypes!= "O"]

Fill in the missing data:

data[num_cols] = data[num_cols].fillna(0)
data[cat_cols] = data[cat_cols].fillna("Missing")

Save the prepared data:

data.to_csv("credit_approval_uci.csv", index=False)

You can find a Jupyter Notebook that contains these commands in this book’s GitHub repository at https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook-Second-Edition/blob/main/ch02-categorical-encoding/donwload-prepare-store-credit-approval-dataset.ipynb.

Note

Some libraries require that you have already imputed missing data, for which you can use any of the recipes from Chapter 1, Imputing Missing Data.

You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Table of Contents (14) Chapters

Technical requirements

Authors (1)

Personalised recommendations for you