You're reading from Python Machine Learning Cookbook 100 recipes that teach you how to perform various machine learning tasks in the real world

Product type Paperback

Published in Jun 2016

Publisher Packt

ISBN-13 9781786464477

Length 304 pages

Edition 1st Edition

Languages

Python

Concepts

Machine Learning

Authors (2):

Vahid Mirjalili

Prateek Joshi

View More author details

Table of Contents (14) Chapters

Preface

1. The Realm of Supervised Learning FREE CHAPTER

2. Constructing a Classifier

3. Predictive Modeling

4. Clustering with Unsupervised Learning

5. Building Recommendation Engines

6. Analyzing Text Data

7. Speech Recognition

8. Dissecting Time Series and Sequential Data

9. Image Content Analysis

10. Biometric Face Recognition

11. Deep Neural Networks

12. Visualizing Data

Index

Estimating housing prices

It's time to apply our knowledge to a real world problem. Let's apply all these principles to estimate the housing prices. This is one of the most popular examples that is used to understand regression, and it serves as a good entry point. This is intuitive and relatable, hence making it easier to understand concepts before we perform more complex things in machine learning. We will use a decision tree regressor with AdaBoost to solve this problem.

Getting ready

A decision tree is a tree where each node makes a simple decision that contributes to the final output. The leaf nodes represent the output values, and the branches represent the intermediate decisions that were made, based on input features. AdaBoost stands for Adaptive Boosting, and this is a technique that is used to boost the accuracy of the results from another system. This combines the outputs from different versions of the algorithms, called weak learners, using a weighted summation to get the final output. The information that's collected at each stage of the AdaBoost algorithm is fed back into the system so that the learners at the latter stages focus on training samples that are difficult to classify. This is the way it increases the accuracy of the system.

Using AdaBoost, we fit a regressor on the dataset. We compute the error and then fit the regressor on the same dataset again, based on this error estimate. We can think of this as fine-tuning of the regressor until the desired accuracy is achieved. You are given a dataset that contains various parameters that affect the price of a house. Our goal is to estimate the relationship between these parameters and the house price so that we can use this to estimate the price given unknown input parameters.

How to do it…

Create a new file called housing.py, and add the following lines:

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn import datasets
from sklearn.metrics import mean_squared_error, explained_variance_score
from sklearn.utils import shuffle
import matplotlib.pyplot as plt

There is a standard housing dataset that people tend to use to get started with machine learning. You can download it at https://archive.ics.uci.edu/ml/datasets/Housing. The good thing is that scikit-learn provides a function to directly load this dataset:
```
housing_data = datasets.load_boston() 
```
Each datapoint has 13 input parameters that affect the price of the house. You can access the input data using housing_data.data and the corresponding price using housing_data.target.
Let's separate this into input and output. To make this independent of the ordering of the data, let's shuffle it as well:
```
X, y = shuffle(housing_data.data, housing_data.target, random_state=7)
```
The random_state parameter controls how we shuffle the data so that we can have reproducible results. Let's divide the data into training and testing. We'll allocate 80% for training and 20% for testing:
```
num_training = int(0.8 * len(X))
X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]
```
We are now ready to fit a decision tree regression model. Let's pick a tree with a maximum depth of 4, which means that we are not letting the tree become arbitrarily deep:
```
dt_regressor = DecisionTreeRegressor(max_depth=4)
dt_regressor.fit(X_train, y_train)
```
Let's also fit decision tree regression model with AdaBoost:
```
ab_regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=400, random_state=7)
ab_regressor.fit(X_train, y_train)
```
This will help us compare the results and see how AdaBoost really boosts the performance of a decision tree regressor.

Let's evaluate the performance of decision tree regressor:

y_pred_dt = dt_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred_dt)
evs = explained_variance_score(y_test, y_pred_dt) 
print "\n#### Decision Tree performance ####"
print "Mean squared error =", round(mse, 2)
print "Explained variance score =", round(evs, 2)

Now, let's evaluate the performance of AdaBoost:

y_pred_ab = ab_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred_ab)
evs = explained_variance_score(y_test, y_pred_ab) 
print "\n#### AdaBoost performance ####"
print "Mean squared error =", round(mse, 2)
print "Explained variance score =", round(evs, 2)

Here is the output on the Terminal:

#### Decision Tree performance ####
Mean squared error = 14.79
Explained variance score = 0.82

#### AdaBoost performance ####
Mean squared error = 7.54
Explained variance score = 0.91

The error is lower and the variance score is closer to 1 when we use AdaBoost as shown in the preceding output.

You're reading from Python Machine Learning Cookbook 100 recipes that teach you how to perform various machine learning tasks in the real world

Table of Contents (14) Chapters

Estimating housing prices

Getting ready

How to do it…

Authors (2)

Personalised recommendations for you