Throughout this book, we will use many open source Python libraries for numerical computing. I recommend installing the free Anaconda Python distribution (https://www.anaconda.com/distribution/), which contains most of these packages. To install the Anaconda distribution, follow these steps:
- Visit the Anaconda website: https://www.anaconda.com/distribution/.
- Click the Download button.
- Download the latest Python 3 distribution that's appropriate for your operating system.
- Double-click the downloaded installer and follow the instructions that are provided.
In this chapter, we will use pandas, NumPy, Matplotlib, seaborn, SciPy, and scikit-learn. pandas provides high-performance analysis tools. NumPy provides support for large, multi-dimensional arrays and matrices and contains a large collection of mathematical functions to operate over these arrays and over pandas dataframes. Matplotlib and seaborn are the standard libraries for plotting and visualization. SciPy is the standard library for statistics and scientific computing, while scikit-learn is the standard library for machine learning.
To run the recipes in this chapter, I used Jupyter Notebooks since they are great for visualization and data analysis and make it easy to examine the output of each line of code. I recommend that you follow along with Jupyter Notebooks as well, although you can execute the recipes in other interfaces.
In this chapter, we will use two public datasets: the KDD-CUP-98 dataset and the Car Evaluation dataset. Both of these are available at the UCI Machine Learning Repository.
To download the KDD-CUP-98 dataset, follow these steps:
- Visit the following website: https://archive.ics.uci.edu/ml/machine-learning-databases/kddcup98-mld/epsilon_mirror/.
- Click the cup98lrn.zip link to begin the download:
- Unzip the file and save cup98LRN.txt in the same folder where you'll run the commands of the recipes.
To download the Car Evaluation dataset, follow these steps:
- Go to the UCI website: https://archive.ics.uci.edu/ml/machine-learning-databases/car/.
- Download the car.data file:
- Save the file in the same folder where you'll run the commands of the recipes.
We will also use the Titanic dataset that's available at http://www.openML.org. To download and prepare the Titanic dataset, open a Jupyter Notebook and run the following commands:
import numpy as np
import pandas as pd
def get_first_cabin(row):
try:
return row.split()[0]
except:
return np.nan
url = "https://www.openml.org/data/get_csv/16826755/phpMYEkMl"
data = pd.read_csv(url)
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].apply(get_first_cabin)
data.to_csv('titanic.csv', index=False)
The preceding code block will download a copy of the data from http://www.openML.org and store it as a titanic.csv file in the same directory from where you execute the commands.