Implementing backpropagation
In feedforward propagation, we connected the input layer to the hidden layer, which was then connected to the output layer. In the first iteration, we initialized weights randomly and then calculated the loss resulting from those weight values. In backpropagation, we take the reverse approach. We start with the loss value obtained in feedforward propagation and update the weights of the network in such a way that the loss value is minimized as much as possible.
The loss value is reduced as we perform the following steps:
- Change each weight within the neural network by a small amount – one at a time.
- Measure the change in loss () when the weight value is changed ().
- Update the weight by , where k is a positive value and is a hyperparameter known as the learning rate.
Note that the update made to a particular weight is proportional to the amount of loss that is reduced by changing it by a small amount. Intuitively, if changing a weight reduces the loss by a large value, then we can update the weight by a large amount. However, if the loss reduction is small by changing the weight, then we update it only by a small amount.
If the preceding steps are performed n number of times on the entire dataset (where we have done both the feedforward propagation and backpropagation), it essentially results in training for n epochs.
As a typical neural network contains thousands/millions of weights, changing the value of each weight and checking whether the loss increased or decreased is not optimal. The core step in the preceding list is the measurement of change of loss when the weight is changed. As you might have studied in calculus, measuring this is the same as computing the gradient of loss concerning the weight. There’s more on leveraging partial derivatives from calculus to calculate the gradient of the loss concerning the weight in the next section, on the chain rule for backpropagation. In this section though, we will implement gradient descent from scratch by updating one weight at a time by a small amount, as detailed at the start of this section. However, before implementing backpropagation, let’s understand one additional detail of neural networks: the learning rate.
Intuitively, the learning rate helps in building trust in the algorithm. For example, when deciding on the magnitude of the weight update, we would potentially not change the weight value by a big amount in one go but update it more slowly.
This results in obtaining stability in our model; we will look at how the learning rate helps with stability in the Understanding the impact of the learning rate section.
This whole process by which we update weights to reduce errors is called gradient descent. Stochastic gradient descent is how errors are minimized in the preceding scenario. As mentioned, gradient stands for the difference (which is the difference in loss values when the weight value is updated by a small amount) and descent means to reduce. Alternatively, gradient stands for the slope (direction of loss drop) and descent means to move toward lower loss. Stochastic stands for the selection of random samples based on which a decision is taken.
Apart from stochastic gradient descent, many other similar optimizers help to minimize loss values; the different optimizers will be discussed in the next chapter.
In the next two sections, we will learn about coding backpropagation from scratch in Python, and will also discuss, in brief, how backpropagation works using the chain rule.
Gradient descent in code
Gradient descent is implemented in Python as follows:
The following code is available as Gradient_descent.ipynb
in the Chapter01
folder of this book’s GitHub repository – https://bit.ly/mcvp-2e.
- Define the feedforward network and calculate the mean squared error loss value, as we did in the Feedforward propagation in code section:
from copy import deepcopy import numpy as np def feed_forward(inputs, outputs, weights): pre_hidden = np.dot(inputs,weights[0])+ weights[1] hidden = 1/(1+np.exp(-pre_hidden)) pred_out = np.dot(hidden, weights[2]) + weights[3] mean_squared_error = np.mean(np.square(pred_out - outputs)) retur mean_squared_error
- Increase each weight and bias value by a very small amount (0.0001) and calculate the overall squared error loss value one at a time for each of the weight and bias updates.
- In the following code, we are creating a function named
update_weights
, which performs the gradient descent process to update weights. The inputs to the function are the input variables to the network –inputs
, expectedoutputs
,weights
(which are randomly initialized at the start of training the model), and the learning rate of the model,lr
(more on the learning rate in a later section):
def update_weights(inputs, outputs, weights, lr):
- Ensure that you deepcopy the list of weights. As the weights will be manipulated in later steps,
deepcopy
ensures we can work with multiple copies of weights without disturbing the original weight values. We will create three copies of the original set of weights that were passed as an input to the function –original_weights
,temp_weights
, andupdated_weights
:
original_weights = deepcopy(weights) temp_weights = deepcopy(weights) updated_weights = deepcopy(weights)
- Calculate the loss value (original_loss) with the original set of weights by passing inputs, outputs, and original_weights through the feed_forward function:
original_loss = feed_forward(inputs, outputs, original_weights)
- We will loop through all the layers of the network:
for i, layer in enumerate(original_weights):
- There are a total of four lists of parameters within our neural network – two lists for the weight and bias parameters that connect the input to the hidden layer and another two lists for the weight and bias parameters that connect the hidden layer to the output layer. Now, we loop through all the individual parameters and, because each list has a different shape, we leverage
np.ndenumerate
to loop through each parameter within a given list:
for index, weight in np.ndenumerate(layer):
- Now we store the original set of weights in
temp_weights
. We select its index weight present in the ith layer and increase it by a small value. Finally, we compute the new loss with the new set of weights for the neural network:
temp_weights = deepcopy(weights) temp_weights[i][index] += 0.0001 _loss_plus = feed_forward(inputs, outputs, temp_weights)
-
In the first line of the preceding code, we reset
temp_weights
to the original set of weights as, in each iteration, we update a different parameter to calculate the loss when a parameter is updated by a small amount within a given epoch.
- We calculate the gradient (change in loss value) due to the weight change:
grad = (_loss_plus - original_loss)/(0.0001)
- In the following code, we are creating a function named
This process of updating a parameter by a very small amount and then calculating the gradient is equivalent to the process of differentiation.
- Finally, we update the parameter present in the corresponding ith layer and
index
ofupdated_
weights. The updated weight value will be reduced in proportion to the value of the gradient. Furthermore, instead of completely reducing it by a value equal to the gradient value, we bring in a mechanism to build trust slowly by using the learning rate –lr
(more on learning rate in the Understanding the impact of the learning rate section):
-
updated_weights[i][index] -= grad*lr
- Once the parameter values across all layers and indices within layers are updated, we return the updated weight values –
updated_weights
:
-
return updated_weights, original_loss
One of the other parameters in a neural network is the batch size considered in calculating the loss values.
In the preceding scenario, we considered all the data points to calculate the loss (mean squared error) value. However, in practice, when we have thousands (or in some cases, millions) of data points, the incremental contribution of a greater number of data points while calculating the loss value would follow the law of diminishing returns, and hence we would be using a batch size that is much smaller compared to the total number of data points we have. We will apply gradient descent (after feedforward propagation) using one batch at a time until we exhaust all data points within one epoch of training. The typical batch size considered in building a model is anywhere between 32 and 1,024. It’s usually a power of 2, and for very, very large models, depending on the scenario, the batch size can be less than 32.
Implementing backpropagation using the chain rule
So far, we have calculated gradients of loss concerning weight by updating the weight by a small amount and then calculating the difference between the feedforward loss in the original scenario (when the weight was unchanged) and the feedforward loss after updating weights. One drawback of updating weight values in this manner is that when the network is large (with more weights to update), a large number of computations are needed to calculate loss values (and in fact, the computations are to be done twice – once where weight values are unchanged, and again, where weight values are updated by a small amount). This results in more computations and hence requires more resources and time. In this section, we will learn about leveraging the chain rule, which does not require us to manually compute loss values to come up with the gradient of the loss concerning the weight value.
In the first iteration (where we initialized weights randomly), the predicted value of the output is 1.235. To get the theoretical formulation, let’s denote the weights and hidden layer values and hidden layer activations as w, h, and a, respectively, as follows:
Figure 1.15: Generalizing the weight initialization process
Note that, in the preceding diagrams, we have taken each component value of the left diagram and generalized it in the diagram on the right.
To keep it easy to comprehend, in this section, we will understand how to use the chain rule to compute the gradient of loss value with respect to only w11. The same learning can be extended to all the weights and biases of the neural network. We encourage you to practice and apply the chain rule calculation to the rest of the weights and bias values. Additionally, to keep this simple for our learning purposes, we will be working on only one data point, where the input is {1,1} and the expected output is {0}.
The chain_rule.ipynb
notebook in the Chapter01
folder of this book’s GitHub repository at https://bit.ly/mcvp-2e contains the way to calculate gradients with respect to changes in weights and biases for all the parameters in a network using the chain rule.
Given that we are calculating the gradient of loss value with w11, let’s understand all the intermediate components that are to be included while calculating the gradient through the following diagram (the components that do not connect the output to w11 are grayed out in the following diagram):
Figure 1.16: Highlighting the values (h11, a11, ŷ) that are needed to calculate the gradient of loss w.r.t w11
From the preceding diagram, we can see that w11 is contributing to the loss value through the path that is highlighted, – , , and . Let’s formulate how , , and are obtained individually.
The loss value of the network is represented as follows:
The predicted output value is calculated as follows:
The hidden layer activation value (sigmoid activation) is calculated as follows:
The hidden layer value is calculated as follows:
Now that we have formulated all the equations, let’s calculate the impact of the change in the loss value (C) with respect to the change in weight , as follows:
This is called a chain rule. Essentially, we are performing a chain of differentiations to fetch the differentiation of our interest.
Note that, in the preceding equation, we have built a chain of partial differential equations in such a way that we are now able to perform partial differentiation on each of the four components individually and, ultimately, calculate the derivative of the loss value with respect to the weight value .
The individual partial derivatives in the preceding equation are computed as follows:
- The partial derivative of the loss value with respect to the predicted output value is as follows:
- The partial derivative of the predicted output value with respect to the hidden layer activation value is as follows:
- The partial derivative of the hidden layer activation value with respect to the hidden layer value prior to activation is as follows:
Note that the preceding equation comes from the fact that the derivative of the sigmoid function is as follows:
- The partial derivative of the hidden layer value prior to activation with respect to the weight value is as follows:
With the calculation of individual partial derivatives in place, the gradient of the loss value with respect to is calculated by replacing each of the partial differentiation terms with the corresponding value, as calculated in the previous steps, as follows:
From the preceding formula, we can see that we are now able to calculate the impact on the loss value of a small change in the weight value (the gradient of the loss with respect to weight) without brute-forcing our way by recomputing the feedforward propagation again.
Next, we will go ahead and update the weight value as follows:
Working versions of the two methods 1) identifying gradients using the chain rule and then updating weights, and 2) updating weight values by learning the impact a small change in weight value can have on the loss values, resulting in the same values for updated weight values, are provided in the notebook Chain_rule.ipynb
in the Chapter01
folder of this book’s GitHub repository – https://bit.ly/mcvp-2e.
In gradient descent, we performed the weight update process sequentially (one weight at a time). By leveraging the chain rule, we learned that there is an alternative way to calculate the impact of a change in weight by a small amount on the loss value, however, with an opportunity to perform computations in parallel.
Because we are updating parameters across all layers, the whole process of updating parameters can be parallelized. Furthermore, given that in a realistic scenario, there can exist millions of parameters across layers, performing the calculation for each parameter on a different core of GPU results in the time taken to update weights is a much faster exercise than looping through each weight one at a time.
Now that we have a solid understanding of backpropagation, both from an intuition perspective and also by leveraging the chain rule, let’s learn about how feedforward and backpropagation work together to arrive at the optimal weight values.
Putting feedforward propagation and backpropagation together
In this section, we will build a simple neural network with a hidden layer that connects the input to the output on the same toy dataset that we worked on in the Feedforward propagation in code section and also leverage the update_weights
function that we defined in the previous section to perform backpropagation to obtain the optimal weight and bias values.
Note that we are not leveraging the chain rule, only to give you a solid understanding of the basics of forward and back-propagation. Starting in the next chapter, you will not be performing neural network training in this manner.
We define the model as follows:
- The input is connected to a hidden layer that has three units/ nodes.
- The hidden layer is connected to the output, which has one unit in the output layer.
The following code is available as Back_propagation.ipynb
in the Chapter01
folder of this book’s GitHub repository – https://bit.ly/mcvp-2e.
We will create the network as follows:
- Import the relevant packages and define the dataset:
from copy import deepcopy import numpy as np x = np.array([[1,1]]) y = np.array([[0]])
- Initialize the weight and bias values randomly.
The hidden layer has three units in it and each input node is connected to each of the hidden layer units. Hence, there are a total of six weight values and three bias values – one bias and two weights (two weights coming from two input nodes) corresponding to each of the hidden units. Additionally, the final layer has one unit that is connected to the three units of the hidden layer. Hence, a total of three weights and one bias dictate the value of the output layer. The randomly initialized weights are as follows:
W = [ np.array([[-0.0053, 0.3793], [-0.5820, -0.5204], [-0.2723, 0.1896]], dtype=np.float32).T, np.array([-0.0140, 0.5607, -0.0628], dtype=np.float32), np.array([[ 0.1528,-0.1745,-0.1135]],dtype=np.float32).T, np.array([-0.5516], dtype=np.float32) ]
In the preceding code, the first array of parameters corresponds to the 2 x 3 matrix of weights that connect the input layer to the hidden layer. The second array of parameters represents the bias values associated with each node of the hidden layer. The third array of parameters corresponds to the 3 x 1 matrix of weights joining the hidden layer to the output layer, and the final array of parameters represents the bias associated with the output layer.
- Run the neural network through 100 epochs of feedforward propagation and backpropagation – the functions of which were already learned and defined as
feed_forward
andupdate_weights
functions in the previous sections:- Define the
feed_forward
function:
def feed_forward(inputs, outputs, weights): pre_hidden = np.dot(inputs,weights[0])+ weights[1] hidden = 1/(1+np.exp(-pre_hidden)) pred_out = np.dot(hidden, weights[2]) + weights[3] mean_squared_error = np.mean(np.square(pred_out - outputs)) return mean_squared_error
- Define the
update_weights
function (we will learn more about the learning rate lr in the next section):
def update_weights(inputs, outputs, weights, lr): original_weights = deepcopy(weights) temp_weights = deepcopy(weights) updated_weights = deepcopy(weights) original_loss = feed_forward(inputs, outputs, original_weights) for i, layer in enumerate(original_weights): for index, weight in np.ndenumerate(layer): temp_weights = deepcopy(weights) temp_weights[i][index] += 0.0001 _loss_plus = feed_forward(inputs, outputs, temp_weights) grad = (_loss_plus - original_loss)/(0.0001) updated_weights[i][index] -= grad*lr return updated_weights, original_loss
- Update weights over 100 epochs and fetch the loss value and the updated weight values:
losses = [] for epoch in range(100): W, loss = update_weights(x,y,W,0.01) losses.append(loss)
- Define the
- Plot the loss values:
import matplotlib.pyplot as plt %matplotlib inline plt.plot(losses) plt.title('Loss over increasing number of epochs') plt.xlabel('Epochs') plt.ylabel('Loss value')
The preceding code generates the following plot:
Figure 1.17: Loss value over increasing epochs
As you can see, the loss started at around 0.33 and steadily dropped to around 0.0001. This is an indication that weights are adjusted according to the input-output data, and when an input is given, we can expect it to predict the output that we have been comparing it against in the loss function. The output weights are as follows:
[array([[ 0.01424004, -0.5907864 , -0.27549535], [ 0.39883757, -0.52918637, 0.18640439]], dtype=float32), array([ 0.00554004, 0.5519136 , -0.06599568], dtype=float32), array([[ 0.3475135 ], [-0.05529078], [ 0.03760847]], dtype=float32), array([-0.22443289], dtype=float32)]
The PyTorch version of the same code with the same weights is demonstrated in the file
Auto_gradient_of_tensors.ipynb
in theChapter02
folder in the GitHub repository at https://bit.ly/mcvp-2e. Revisit this section after understanding the core PyTorch concepts in the next chapter. Verify for yourself that the input and output are indeed the same regardless of whether the network is written in NumPy or PyTorch.Building a network from scratch using NumPy arrays, while sub-optimal, is done in this chapter to give you a solid foundation in the working details of neural networks.
- Once we have the updated weights, make the predictions for the input by passing the input through the network and calculate the output value:
pre_hidden = np.dot(x,W[0]) + W[1] hidden = 1/(1+np.exp(-pre_hidden)) pred_out = np.dot(hidden, W[2]) + W[3] # -0.017
The output of the preceding code is the value of -0.017
, which is a value that is very close to the expected output of 0. As we train for more epochs, the pred_out
value gets even closer to 0.
So far, we have learned about feedforward propagation and backpropagation. The key piece in the update_weights
function that we defined here is the learning rate, which we will learn about in the next section.