Packt+ | Advance your knowledge in tech

You're reading from Python Feature Engineering Cookbook Over 70 recipes for creating, engineering, and transforming features to build machine learning models

Product type Paperback

Published in Jan 2020

Publisher Packt

ISBN-13 9781789806311

Length 372 pages

Edition 1st Edition

Languages

Python

Tools

NumPy

Concepts

Machine Learning

Author (1):

Soledad Galli

View More author details

Throughout this book, we will use many open source Python libraries for numerical computing. I recommend installing the free Anaconda Python distribution (https://www.anaconda.com/distribution/), which contains most of these packages. To install the Anaconda distribution, follow these steps:

Visit the Anaconda website: https://www.anaconda.com/distribution/.
Click the Download button.
Download the latest Python 3 distribution that's appropriate for your operating system.
Double-click the downloaded installer and follow the instructions that are provided.

The recipes in this book were written in Python 3.7. However, they should work in Python 3.5 and above. Check that you are using similar or higher versions of the numerical libraries we'll be using, that is, NumPy, pandas, scikit-learn, and others. The versions of these libraries are indicated in the requirement.txt file in the accompanying GitHub repository (https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook).

In this chapter, we will use pandas, NumPy, Matplotlib, seaborn, SciPy, and scikit-learn. pandas provides high-performance analysis tools. NumPy provides support for large, multi-dimensional arrays and matrices and contains a large collection of mathematical functions to operate over these arrays and over pandas dataframes. Matplotlib and seaborn are the standard libraries for plotting and visualization. SciPy is the standard library for statistics and scientific computing, while scikit-learn is the standard library for machine learning.

To run the recipes in this chapter, I used Jupyter Notebooks since they are great for visualization and data analysis and make it easy to examine the output of each line of code. I recommend that you follow along with Jupyter Notebooks as well, although you can execute the recipes in other interfaces.

The recipe commands can be run using a .py script from a command prompt (such as the Anaconda Prompt or the Mac Terminal) using an IDE such as Spyder or PyCharm or from Jupyter Notebooks, as in the accompanying GitHub repository (https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook).

In this chapter, we will use two public datasets: the KDD-CUP-98 dataset and the Car Evaluation dataset. Both of these are available at the UCI Machine Learning Repository.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

To download the KDD-CUP-98 dataset, follow these steps:

Visit the following website: https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup98-mld/epsilon_mirror/.
Click the cup98lrn.zip link to begin the download:

Unzip the file and save cup98LRN.txt in the same folder where you'll run the commands of the recipes.

To download the Car Evaluation dataset, follow these steps:

Go to the UCI website: https://archive.ics.uci.edu/ml/machine-learning-databases/car/.
Download the car.data file:

Save the file in the same folder where you'll run the commands of the recipes.

We will also use the Titanic dataset that's available at http://www.openML.org. To download and prepare the Titanic dataset, open a Jupyter Notebook and run the following commands:

import numpy as np
import pandas as pd

def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan 

url = "https://www.openml.org/data/get_csv/16826755/phpMYEkMl"
data = pd.read_csv(url)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].apply(get_first_cabin)
data.to_csv('titanic.csv', index=False)

The preceding code block will download a copy of the data from http://www.openML.org and store it as a titanic.csv file in the same directory from where you execute the commands.

There is a Jupyter Notebook with instructions on how to download and prepare the titanic dataset in the accompanying GitHub repository: https://github.com/PacktPublishing/Python-Feature-Engineering-Cookbook/blob/master/Chapter01/DataPrep_Titanic.ipynb.