In the previous section, we learned how to normalize the data and initialize the weights, along with choosing a good activation function that can dramatically speed up neural network learning time.
In this section, we'll take a step further by optimizing our algorithm and the way we update the weights in a backward pass.
To do this, we need to revisit how a neural network learns. We begin with training data of size m, and each of the examples depicted in this section has n features, and for each of the examples, we also have its corresponding prediction value:
What we want is a neural network to learn from these examples to predict the results for its own examples. The first step is to do the forward pass with just a multiplication with these m examples. This is done using matrices for efficiency, because all this multiplication can be done in parallel and thus we can make good use of the GPU and CPU power. This will produce the m hypothesis. After that, it's time to see how good our hypothesis is doing compared to real values and hypothetical ones. This is done by using the cost function, which is the averaging difference between the hypothesis and all the examples.
After this, we will do the backward pass, where we simply update the weight in such a way that the next time we run this network, we'll have a better hypothesis. This is done by calculating the derivative of the cost function by all the weights, and subtracting this from the current weights value. If we normalize our data and initialize it with Xavier, as depicted in the following diagram, we notice the progress to the minimum value where the hypothesis is almost equal to the real value:
One forward pass and its simultaneous backward pass through the data is called an epoch. We need approximately 100 to 150 epochs to go to the minimum value.
Previously, we had a lower number of examples and it was alright to run the network for 100-150 epochs. But these days, with 1 million examples in a dataset, the amount of time consumed to run the network or even to move from one step to the other will be ridiculously long. Of course, the reason for this is that we are multiplying each weight with a matrix consisting of 1 million examples here. This will obviously slow down the learning drastically. To make it worse, this would happen for 100-150 epochs, which makes this practically impossible.
One way to improve this is to update the weights for each example, instead of waiting until the network has seen all 1 million of examples. The network will now look as follows:
When initialized on Xavier, this is what it looks as in following diagram:
The positive aspect of this is that we get to the minimum point really quick. The downside of this is that the progress is really noisy. We have a lot of oscillating values and sometimes the point does not move toward the minimum point, and instead moves backward.
This is not a huge obstacle as it can be resolved to a certain extent by reducing the learning rate. Thus, we can attain better results.
The greatest disadvantage of this method is that we actually do not make appropriate use of the parallelism and the great processing power of the CPUs and GPUs, as we are multiplying the weights with just one example.
In order to reap the benefits of both the methods, we can use a mini-batch gradient descent, which, instead of using 1 or 1 million examples, uses k-number of examples. The value of the k-number can be more or less; it could be 500, 1,000, or 32. It is a value that would make the matrix big enough to use the maximum processing power of GPUs or even CPUs. If we initialize Xavier, it looks something like the following:
Even here, we observe that the progress to the center is not as noisy as in the stochastic gradient distant with the one example we have seen so far.
If we take k as the size of the example or the mini-batch gradient before updating the weights, then it will take us m/k iterations to complete sifting through all the data; that is, to do one epoch. The normally-used k-values are 16, 32, 64,...1,024. Basically, just a power of two, as it provides good results.
In general cases, we need several epochs to go to the minimum point, which means for one epoch, we need m/k iterations to go through all the data. The number of epochs varies from case to case; we just need to see how our neural network progresses.
The mini-batch gradient descent is the default choice and is really efficient in production. Even if the oscillating values were improved, they were quite evident, so let's see how we can improve this.
Let's suppose that we have the data with oscillating values, as depicted in the following graph:
The aim is to have a smooth curve to our data, instead of these oscillating values. To do this, we could trust a new common example, but it would be better to take the average of the new example and an older example, and instead trust that average. Of course, we don't want a simple average here, we just want more recent values to have a greater impact on our output. The following mathematical function, which is the equation for expanded weighted average, helps us do this:
In order to understand why this works, let us consider the example for . The equation for , , and , and will be as follows:
Substitute the values of and in the equation of , which is mathematically shown as follows:
Substituting these values, we get the following:
Observe that the weights here are different from each of these examples. If we replace the value of with 0.5, we can see the gradual decrease in the weights across the equation. This gives us the possibility to have a weighted average, such that values that are not important have smaller weights, leading it to contribute less to the overall average.
It just so happens that this greatly affects the value of the result. Let us look at a comparative analysis of various values of . If = 0.2, 0.5, 0.7, and 0.9, here is what your graph will look like:
When the value is is 0.2, the orange line in the graph depicting the output of is almost identical to the blue line. As the value of is increased, we can see that the output has fewer oscillations and a smoother curve.
Let's understand how to use this technique to smooth out the updates of the weights in the network. Instead of taking on the form of oscillating values, we want a more linear update:
The first step is to modify the way we update the weights, instead of immediately trusting the derivative cost function with a weight with the help of the equation. Using all the equations described previously, we make our results more linear in nature.
There are two more ways to smooth out oscillations:
- RSMprop: This has the same core concept and intuition as the method explained previously. It is mathematically expressed as follows:
In this case, instead of the derivative, we have to use a squared derivative. Similarly, we do not take the derivative of the cost value but instead divide it with the square root of . This only works because we have with a large value, thus, automatically increasing the value of . We divide the cost function by a large value to get smaller values and smooth out the oscillations.
- ADAM: The ADAM method is simply a merge of what we've seen so far. We calculate the value of and using the following formulas:
Instead of trusting a new example, we update the weights using this formula:
Here, is divided by the square root of .
ADAM is a really helpful technique and is used widely across neural networks.
Fortunately, most of the time, we don't need really to vary the and parameters. The default parameters work just fine. varies rarely from 0.85 to 0.9. almost never changes and stays constant at 0.999.