You're reading from Hands-On Predictive Analytics with Python Master the complete predictive analytics process, from problem definition to model deployment

Product type Paperback

Published in Dec 2018

Publisher Packt

ISBN-13 9781789138719

Length 330 pages

Edition 1st Edition

Languages

Python

Tools

TensorFlow

Concepts

Predictive Analytics

Author (1):

Alvaro Fuentes

View More author details

Regressing with neural networks

We will again use our diamonds dataset. Although this is a small dataset and MLP is perhaps a model that is too complicated for this problem, there is no reason we could not use an MLP to solve it; in addition to this, remember that back when we defined the hypothetical problem, we established that the stakeholders wanted a model that was as accurate as possible in their predictions, so let's see how accurate we can get with an MLP. As always, let's import the libraries we will use:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline

Now, since we are beginning from scratch, load and prepare the dataset:

DATA_DIR = '../data'
FILE_NAME = 'diamonds.csv'
data_path = os.path.join(DATA_DIR, FILE_NAME)
diamonds = pd.read_csv(data_path)
## Preparation done from Chapter 2
diamonds = diamonds.loc[(diamonds['x']>0) | (diamonds['y']>0)]
diamonds.loc[11182, 'x'] = diamonds['x'].median()
diamonds.loc[11182, 'z'] = diamonds['z'].median()
diamonds = diamonds.loc[~((diamonds['y'] > 30) | (diamonds['z'] > 30))]
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['cut'], prefix='cut', drop_first=True)], axis=1)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['color'], prefix='color', drop_first=True)], axis=1)
diamonds = pd.concat([diamonds, pd.get_dummies(diamonds['clarity'], prefix='clarity', drop_first=True)], axis=1)

Now, let's apply the transformations we did to this dataset before modeling:

Split these into training and testing sets:

X = diamonds.drop(['cut','color','clarity','price'], axis=1)
y = diamonds['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=123)

Perform dimensionality reduction on x, y, and z with PCA:

from sklearn.decomposition import PCA
pca = PCA(n_components=1, random_state=123)
pca.fit(X_train[['x','y','z']])
X_train['dim_index'] = pca.transform(X_train[['x','y','z']]).flatten()
X_train.drop(['x','y','z'], axis=1, inplace=True)

And here is the last step—standardize the numerical features:

numerical_features = ['carat', 'depth', 'table', 'dim_index']
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train[numerical_features])
X_train.loc[:, numerical_features] = scaler.transform(X_train[numerical_features])

We are ready to build our neural network model.

Building the MLP for predicting diamond prices

As we said before, neural network models consist of a number sequential layers, which is why Keras has a class called Sequential that we can use to instantiate a neural network model:

from keras.models import Sequential
nn_reg = Sequential()

Good! We have created an empty neural network called nn_reg. Now, we have to add layers to it. We will use what is known as fully connected or dense layers—these are layers made of neurons that are connected with all neurons from the previous layer. In other words, every neuron in a dense layer receives the output of all the neurons from the previous layer. Our MLP will be made of dense layers. Let's import the Dense class:

from keras.layers import Dense

As we discussed in our conceptual section, the first layer in an MLP is always the input layer, the one that will receive the data of the features and pass it to the first hidden layer. However, in Keras, there is no need to create an input layer, because this layer is basically the features. So, you will not explicitly see the input layer in the code, but conceptually it is there. With that point clear, the first layer we will add to our empty neural network is the first hidden layer; this is a special layer, because we need to specify (as a tuple) the shape of the input; from the Keras documentation, we read that the first layer in a sequential model (and only the first, because the following layers can do automatic shape inference) needs to receive information about its input shape. Now, we will add this first layer:

n_input = X_train.shape[1]
n_hidden1 = 32
# adding first hidden layer
nn_reg.add(Dense(units=n_hidden1, activation='relu', input_shape=(n_input,)))

Let's understand each of the parameters:

units: This is the number of neurons in the layer; we are using 32
activation: This is the activation function that will be used in each of the neurons; we are using ReLU
input_shape: This is the number of inputs the network will receive, which is equal to the number of predictive features in our dataset; we don't need to specify how many samples the network will receive, as it can deal with any number of them

Our neural network now has one hidden layer; since this is a relatively simple problem and we have a relatively small dataset, we will add only two more hidden layers, for a total of three. Few people would call this a deep learning model, since we have only three layers, but the process of building and training is essentially the same with three or with 300 hidden layers. This is our first neural network model, so I consider it to be a good start. Let's add the next two hidden layers:

n_hidden2 = 16
n_hidden3 = 8
# add second hidden layer
nn_reg.add(Dense(units=n_hidden2, activation='relu'))
# add third hidden layer
nn_reg.add(Dense(units=n_hidden3, activation='relu'))

Notice the number of units we are using in our successive layers—32, 16, and 8. First, we are using powers of 2, which is a common practice in the field, and second we are shaping our network as a funnel—we are going from 32 units to 8 units; there is nothing special about this shape but, empirically, sometimes it works very well. Another common approach would be to use the same number of neurons in each hidden layer.

To finish our neural network, we need to add the final layer—the output layer. Since this is a regression problem for each sample, we want only one output—the prediction for the price, so we need to add a layer that connects the 8 outputs from the previews layer to one output that will give us the price prediction. In this last layer, there is no need for an activation function, since we are getting the final prediction:

# output layer
nn_reg.add(Dense(units=1, activation=None))

Great! Our model architecture has been defined—our neural network, just like the other models we built before, is a function that will take the values of 21 features and will produce one number as output—the predicted price.

Our neural network has been built. In fact, if you feed it data, you will get price predictions; here, you have the predictions for the first 5 diamonds in the training set:

nn_reg.predict(X_train.iloc[:5,:])

The output will be as follows:

These are the price predictions, and they are, of course, very bad predictions. Why is this? Because every neuron in our network has randomly initialized weights, and biases are all initialized as 1's. Keras, by default, uses an initialization procedure called Glorot uniform initializer, also called Xavier uniform initializer (Glorot & Bengio, 2010), which is one of the most popular ways of initializing neural networks and that has been proven very useful in practice. There are other initialization schemes, but a discussion about those is outside our scope. We are going to trust the good and smart developers of Keras and use their defaults.

Now, it is time to start modifying these random weights and biases, little by little, using our training data, and now we will enter the training loop.

Training the MLP

Now, we will use data for training our neural network so it learns how to map the values of the features to predict the prices. I will repeat some of the things I have already said in the conceptual section, so excuse me for being a bit redundant, but the goal is to present the concepts as clearly as possible.

There are four decisions we have to make at this stage:

Batch size: The number of observations the network will see at each step of the training loop. This decision is actually not very complicated; for a problem such as ours, we can try a batch size of 32, 64, or 128. There is good evidence that it is better to use small numbers, no bigger than 512 (Shirish et al., 2017).
Number of epochs: How many times the network will see the entire training dataset to adjust the weights. Here, we need to be a bit more careful, because if there are too few the network will not learn well; if there are too many networks will overfit to our training data. Let's try 50. Why? Well, it is just my first guess, because it is a relatively simple problem. We can try other values of course, and we'll do this later.
Loss function: As we said before, the loss function produces the signal that will tell the network how good the predictions are. In the case of regression problems, the most commonly used loss function is the MSE, the one that we have used before in other models to measure their performance. Of course, there are other loss functions that we can use, but for the moment we will stick to MSE.
Optimizer: This is the element by which the network will use the signal produced by the loss function and update the weights and biases of the network. There are many choices of optimizers, and researchers keep making progress in this area, but essentially all optimizers are variations of the gradient descent (https://en.wikipedia.org/wiki/Gradient_descent) optimization algorithm. Again, this is a very technical issue, and we will use the Adam optimizer, which has become very popular because it has been shown to work well for a variety of problems. Please refer to the Further reading section for more resources about optimizers.

Once we have made these four decisions, we can compile our model—this is telling Keras both the loss function and the optimizer we want to use:

nn_reg.compile(loss='mean_squared_error', optimizer='adam')

If you want to take a look at the architecture and the number of parameters in your model, you can use the summary method:

nn_reg.summary()

The output will be as follows:

We have a total of 1,377 weights and biases in our model. Now, we are ready to train our model using the fit method:

batch_size = 64
n_epochs = 50
nn_reg.fit(X_train, y_train, epochs=n_epochs, batch_size=batch_size)

In the training process, you will see something such as this:

This mainly shows, with every epoch, how the training loss is reduced. Remember that the training loop we talked about was previously done in every epoch, in our case, 50 times. This is great! We have trained our first neural network!

Making predictions with the neural network

It is time to evaluate how good the predictions are that the network makes. We will compare both the training and the testing performances using MSE. But first, remember we have to perform the same transformations we did in our training set to our testing set:

## PCA for dimentionality reduction:
X_test['dim_index'] = pca.transform(X_test[['x','y','z']]).flatten()
X_test.drop(['x','y','z'], axis=1, inplace=True)
## Scale our numerical features so they have zero mean and a variance of one
X_test.loc[:, numerical_features] = scaler.transform(X_test[numerical_features])

Now, make predictions using the predict method and calculate the MSE:

from sklearn.metrics import mean_squared_error
y_pred_train = nn_reg.predict(X_train)
y_pred_test = nn_reg.predict(X_test)
train_mse = mean_squared_error(y_true=y_train, y_pred=y_pred_train)
test_mse = mean_squared_error(y_true=y_test, y_pred=y_pred_test)
print("Train MSE: {:0.3f} \nTest MSE: {:0.3f}".format(train_mse/1e6, test_mse/1e6))

This will give us the following output: