Estimating housing prices
It's time to apply our knowledge to a real world problem. Let's apply all these principles to estimate the housing prices. This is one of the most popular examples that is used to understand regression, and it serves as a good entry point. This is intuitive and relatable, hence making it easier to understand concepts before we perform more complex things in machine learning. We will use a decision tree regressor with AdaBoost to solve this problem.
Getting ready
A decision tree is a tree where each node makes a simple decision that contributes to the final output. The leaf nodes represent the output values, and the branches represent the intermediate decisions that were made, based on input features. AdaBoost stands for Adaptive Boosting, and this is a technique that is used to boost the accuracy of the results from another system. This combines the outputs from different versions of the algorithms, called weak learners, using a weighted summation to get the final output. The information that's collected at each stage of the AdaBoost algorithm is fed back into the system so that the learners at the latter stages focus on training samples that are difficult to classify. This is the way it increases the accuracy of the system.
Using AdaBoost, we fit a regressor on the dataset. We compute the error and then fit the regressor on the same dataset again, based on this error estimate. We can think of this as fine-tuning of the regressor until the desired accuracy is achieved. You are given a dataset that contains various parameters that affect the price of a house. Our goal is to estimate the relationship between these parameters and the house price so that we can use this to estimate the price given unknown input parameters.
How to do it…
- Create a new file called
housing.py
, and add the following lines:import numpy as np from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import AdaBoostRegressor from sklearn import datasets from sklearn.metrics import mean_squared_error, explained_variance_score from sklearn.utils import shuffle import matplotlib.pyplot as plt
- There is a standard housing dataset that people tend to use to get started with machine learning. You can download it at https://archive.ics.uci.edu/ml/datasets/Housing. The good thing is that scikit-learn provides a function to directly load this dataset:
housing_data = datasets.load_boston()
Each datapoint has 13 input parameters that affect the price of the house. You can access the input data using
housing_data.data
and the corresponding price usinghousing_data.target
. - Let's separate this into input and output. To make this independent of the ordering of the data, let's shuffle it as well:
X, y = shuffle(housing_data.data, housing_data.target, random_state=7)
- The
random_state
parameter controls how we shuffle the data so that we can have reproducible results. Let's divide the data into training and testing. We'll allocate 80% for training and 20% for testing:num_training = int(0.8 * len(X)) X_train, y_train = X[:num_training], y[:num_training] X_test, y_test = X[num_training:], y[num_training:]
- We are now ready to fit a decision tree regression model. Let's pick a tree with a maximum depth of
4
, which means that we are not letting the tree become arbitrarily deep:dt_regressor = DecisionTreeRegressor(max_depth=4) dt_regressor.fit(X_train, y_train)
- Let's also fit decision tree regression model with AdaBoost:
ab_regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=400, random_state=7) ab_regressor.fit(X_train, y_train)
This will help us compare the results and see how AdaBoost really boosts the performance of a decision tree regressor.
- Let's evaluate the performance of decision tree regressor:
y_pred_dt = dt_regressor.predict(X_test) mse = mean_squared_error(y_test, y_pred_dt) evs = explained_variance_score(y_test, y_pred_dt) print "\n#### Decision Tree performance ####" print "Mean squared error =", round(mse, 2) print "Explained variance score =", round(evs, 2)
- Now, let's evaluate the performance of AdaBoost:
y_pred_ab = ab_regressor.predict(X_test) mse = mean_squared_error(y_test, y_pred_ab) evs = explained_variance_score(y_test, y_pred_ab) print "\n#### AdaBoost performance ####" print "Mean squared error =", round(mse, 2) print "Explained variance score =", round(evs, 2)
Here is the output on the Terminal:
#### Decision Tree performance #### Mean squared error = 14.79 Explained variance score = 0.82 #### AdaBoost performance #### Mean squared error = 7.54 Explained variance score = 0.91
The error is lower and the variance score is closer to 1 when we use AdaBoost as shown in the preceding output.