You're reading from Neural Networks with Keras Cookbook Over 70 recipes leveraging deep learning techniques across image, text, audio, and game bots

Product type Paperback

Published in Feb 2019

Publisher Packt

ISBN-13 9781789346640

Length 568 pages

Edition 1st Edition

Languages

Python

Tools

Keras

Concepts

Deep Learning

Authors (2):

V Kishore Ayyadevara

Srinivas Pradeep

View More author details

Feed-forward propagation from scratch in Python

In order to build a strong foundation of how feed-forward propagation works, we'll go through a toy example of training a neural network where the input to the neural network is (1, 1) and the corresponding output is 0.

Getting ready

The strategy that we'll adopt is as follows: our neural network will have one hidden layer (with neurons) connecting the input layer to the output layer. Note that we have more neurons in the hidden layer than in the input layer, as we want to enable the input layer to be represented in more dimensions:

Calculating the hidden layer unit values

We now assign weights to all of the connections. Note that these weights are selected randomly (based on Gaussian distribution) since it is the first time we're forward-propagating. In this specific case, let's start with initial weights that are between 0 and 1, but note that the final weights after the training process of a neural network don't need to be between a specific set of values:

In the next step, we perform the multiplication of the input with weights to calculate the values of hidden units in the hidden layer.

The hidden layer's unit values are obtained as follows:

The hidden layer's unit values are also shown in the following diagram:

Note that in the preceding output we calculated the hidden values. For simplicity, we excluded the bias terms that need to be added at each unit of a hidden layer.

Now, we will pass the hidden layer values through an activation function so that we attain non-linearity in our output.

If we do not apply the activation function in the hidden layer, the neural network becomes a giant linear connection from input to output.

Applying the activation function

Activation functions are applied at multiple layers of a network. They are used so that we achieve high non-linearity in input, which can be useful in modeling complex relations between the input and output.

The different activation functions are as follows:

For our example, let’s use the sigmoid function for activation. The sigmoid function looks like this, graphically:

By applying sigmoid activation, S(x), to the three hidden=layer sums, we get the following:

final_h₁ = S(1.0) = 0.73

final_h₂ = S(1.3) = 0.78

final_h₃ = S(0.8) = 0.69

Calculating the output layered values

Now that we have calculated the hidden layer values, we will be calculating the output layer value. In the following diagram, we have the hidden layer values connected to the output through the randomly-initialized weight values. Using the hidden layer values and the weight values, we will calculate the output values for the following network:

We perform the sum product of the hidden layer values and weight values to calculate the output value. For simplicity, we excluded the bias terms that need to be added at each unit of the hidden layer:

0.73 * 0.3 + 0.79 * 0.5 + 0.69 * 0.9 = 1.235

The values are shown in the following diagram:

Because we started with a random set of weights, the value of the output neuron is very different from the target, in this case by +1.235 (since the target is 0).

Calculating the loss values

Loss values (alternatively called cost functions) are values that we optimize in a neural network. In order to understand how loss values get calculated, let's look at two scenarios:

Continuous variable prediction
Categorical variable prediction

Calculating loss during continuous variable prediction

Typically, when the variable is a continuous one, the loss value is calculated as the squared error, that is, we try to minimize the mean squared error by varying the weight values associated with the neural network:

In the preceding equation, y(i) is the actual value of output, h(x) is the transformation that we apply on the input (x) to obtain a predicted value of y, and m is the number of rows in the dataset.

Calculating loss during categorical variable prediction

When the variable to predict is a discrete one (that is, there are only a few categories in the variable), we typically use a categorical cross-entropy loss function. When the variable to predict has two distinct values within it, the loss function is binary cross-entropy, and when the variable to predict has multiple distinct values within it, the loss function is a categorical cross-entropy.

Here is binary cross-entropy:

(ylog(p)+(1−y)log(1−p))

Here is categorical cross-entropy:

y is the actual value of output p, is the predicted value of the output and n is the total number of data points. For now, let's assume that the outcome that we are predicting in our toy example is continuous. In that case, the loss function value is the mean squared error, which is calculated as follows:

error = 1.235² = 1.52

In the next step, we will try to minimize the loss function value using back-propagation (which we'll learn about in the next section), where we update the weight values (which were initialized randomly earlier) to minimize the loss (error).

How to do it...

In the previous section, we learned about performing the following steps on top of the input data to come up with error values in forward-propagation (the code file is available as Neural_network_working_details.ipynb in GitHub):

Initialize weights randomly
Calculate the hidden layer unit values by multiplying input values with weights
Perform activation on the hidden layer values
Connect the hidden layer values to the output layer
Calculate the squared error loss

A function to calculate the squared error loss values across all data points is as follows:

import numpy as np
def feed_forward(inputs, outputs, weights):
     pre_hidden = np.dot(inputs,weights[0])+ weights[1]
     hidden = 1/(1+np.exp(-pre_hidden))
     out = np.dot(hidden, weights[2]) + weights[3]
     squared_error = (np.square(pred_out - outputs))
     return squared_error

In the preceding function, we take the input variable values, weights (randomly initialized if this is the first iteration), and the actual output in the provided dataset as the input to the feed-forward function.

We calculate the hidden layer values by performing the matrix multiplication (dot product) of the input and weights. Additionally, we add the bias values in the hidden layer, as follows:

pre_hidden = np.dot(inputs,weights[0])+ weights[1]

The preceding scenario is valid when weights[0] is the weight value and weights[1] is the bias value, where the weight and bias are connecting the input layer to the hidden layer.

Once we calculate the hidden layer values, we perform activation on top of the hidden layer values, as follows:

hidden = 1/(1+np.exp(-pre_hidden))

We now calculate the output at the hidden layer by multiplying the output of the hidden layer with weights that connect the hidden layer to the output, and then adding the bias term at the output, as follows:

pred_out = np.dot(hidden, weights[2]) + weights[3]

Once the output is calculated, we calculate the squared error loss at each row, as follows:

squared_error = (np.square(pred_out - outputs))

In the preceding code, pred_out is the predicted output and outputs is the actual output.

We are then in a position to obtain the loss value as we forward-pass through the network.

While we considered the sigmoid activation on top of the hidden layer values in the preceding code, let's examine other activation functions that are commonly used.

Tanh

The tanh activation of a value (the hidden layer unit value) is calculated as follows:

def tanh(x):
    return (exp(x)-exp(-x))/(exp(x)+exp(-x))

ReLu

The Rectified Linear Unit (ReLU) of a value (the hidden layer unit value) is calculated as follows:

def relu(x):
    return np.where(x>0,x,0)

Linear

The linear activation of a value is the value itself.

Softmax

Typically, softmax is performed on top of a vector of values. This is generally done to determine the probability of an input belonging to one of the n number of the possible output classes in a given scenario. Let's say we are trying to classify an image of a digit into one of the possible 10 classes (numbers from 0 to 9). In this case, there are 10 output values, where each output value should represent the probability of an input image belonging to one of the 10 classes.

The softmax activation is used to provide a probability value for each class in the output and is calculated explained in the following sections:

def softmax(x):
    return np.exp(x)/np.sum(np.exp(x))

Apart from the preceding activation functions, the loss functions that are generally used while building a neural network are as follows.

Mean squared error

The error is the difference between the actual and predicted values of the output. We take a square of the error, as the error can be positive or negative (when the predicted value is greater than the actual value and vice versa). Squaring ensures that positive and negative errors do not offset each other. We calculate the mean squared error so that the error over two different datasets is comparable when the datasets are not the same size.

The mean squared error between predicted values (p) and actual values (y) is calculated as follows:

def mse(p, y):
    return np.mean(np.square(p - y))

The mean squared error is typically used when trying to predict a value that is continuous in nature.

Mean absolute error

The mean absolute error works in a manner that is very similar to the mean squared error. The mean absolute error ensures that positive and negative errors do not offset each other by taking an average of the absolute difference between the actual and predicted values across all data points.

The mean absolute error between the predicted values (p) and actual values (y) is implemented as follows:

def mae(p, y):
    return np.mean(np.abs(p-y))

Similar to the mean squared error, the mean absolute error is generally employed on continuous variables.

Categorical cross-entropy

Cross-entropy is a measure of the difference between two different distributions: actual and predicted. It is applied to categorical output data, unlike the previous two loss functions that we discussed.

Cross-entropy between two distributions is calculated as follows:

y is the actual outcome of the event and p is the predicted outcome of the event.

Categorical cross-entropy between the predicted values (p) and actual values (y) is implemented as follows:

def cat_cross_entropy(p, y):
     return -np.sum((y*np.log2(p)+(1-y)*np.log2(1-p)))

Note that categorical cross-entropy loss has a high value when the predicted value is far away from the actual value and a low value when the values are close.

You're reading from Neural Networks with Keras Cookbook Over 70 recipes leveraging deep learning techniques across image, text, audio, and game bots

Table of Contents (18) Chapters

Feed-forward propagation from scratch in Python

Getting ready

How to do it...

Authors (1)

Other recommended products

Personalised recommendations for you