Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Feature Engineering Cookbook

You're reading from   Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Arrow left icon
Product type Paperback
Published in Jan 2020
Publisher Packt
ISBN-13 9781789806311
Length 372 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Soledad Galli Soledad Galli
Author Profile Icon Soledad Galli
Soledad Galli
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Foreseeing Variable Problems When Building ML Models 2. Imputing Missing Data FREE CHAPTER 3. Encoding Categorical Variables 4. Transforming Numerical Variables 5. Performing Variable Discretization 6. Working with Outliers 7. Deriving Features from Dates and Time Variables 8. Performing Feature Scaling 9. Applying Mathematical Computations to Features 10. Creating Features with Transactional and Time Series Data 11. Extracting Features from Text Variables 12. Other Books You May Enjoy

Technical requirements

Throughout this book, we will use many open source Python libraries for numerical computing. I recommend installing the free Anaconda Python distribution (https://www.anaconda.com/distribution/), which contains most of these packages. To install the Anaconda distribution, follow these steps:

  1. Visit the Anaconda website: https://www.anaconda.com/distribution/.
  2. Click the Download button.
  3. Download the latest Python 3 distribution that's appropriate for your operating system.
  4. Double-click the downloaded installer and follow the instructions that are provided.
The recipes in this book were written in Python 3.7. However, they should work in Python 3.5 and above. Check that you are using similar or higher versions of the numerical libraries we'll be using, that is, NumPy, pandas, scikit-learn, and others. The versions of these libraries are indicated in the requirement.txt file in the accompanying GitHub repository (https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook).

In this chapter, we will use pandas, NumPy, Matplotlib, seaborn, SciPy, and scikit-learn. pandas provides high-performance analysis tools. NumPy provides support for large, multi-dimensional arrays and matrices and contains a large collection of mathematical functions to operate over these arrays and over pandas dataframes. Matplotlib and seaborn are the standard libraries for plotting and visualization. SciPy is the standard library for statistics and scientific computing, while scikit-learn is the standard library for machine learning.

To run the recipes in this chapter, I used Jupyter Notebooks since they are great for visualization and data analysis and make it easy to examine the output of each line of code. I recommend that you follow along with Jupyter Notebooks as well, although you can execute the recipes in other interfaces.

The recipe commands can be run using a .py script from a command prompt (such as the Anaconda Prompt or the Mac Terminal) using an IDE such as Spyder or PyCharm or from Jupyter Notebooks, as in the accompanying GitHub repository (https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook).

In this chapter, we will use two public datasets: the KDD-CUP-98 dataset and the Car Evaluation dataset. Both of these are available at the UCI Machine Learning Repository.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

To download the KDD-CUP-98 dataset, follow these steps:

  1. Visit the following website: https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup98-mld/epsilon_mirror/.
  2. Click the cup98lrn.zip link to begin the download:

  1. Unzip the file and save cup98LRN.txt in the same folder where you'll run the commands of the recipes.

To download the Car Evaluation dataset, follow these steps:

  1. Go to the UCI website: https://archive.ics.uci.edu/ml/machine-learning-databases/car/.
  2. Download the car.data file:

  1. Save the file in the same folder where you'll run the commands of the recipes.

We will also use the Titanic dataset that's available at http://www.openML.org. To download and prepare the Titanic dataset, open a Jupyter Notebook and run the following commands:

import numpy as np
import pandas as pd

def get_first_cabin(row):
try:
return row.split()[0]
except:
return np.nan

url = "https://www.openml.org/data/get_csv/16826755/phpMYEkMl"
data = pd.read_csv(url)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].apply(get_first_cabin)
data.to_csv('titanic.csv', index=False)

The preceding code block will download a copy of the data from http://www.openML.org and store it as a titanic.csv file in the same directory from where you execute the commands.

There is a Jupyter Notebook with instructions on how to download and prepare the titanic dataset in the accompanying GitHub repository: https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook/blob/master/Chapter01/DataPrep_Titanic.ipynb.
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime