Training the network
Stochastic gradient descent (SGD) is an effective way of training deep neural networks. SGD seeks such parameters Θ of the network, which minimize the loss function ℒ.
![](https://static.packt-cdn.com/products/9781787121515/graphics/2b8a4b89-e055-4a9b-826e-102d1339b79f.png)
Where
![](https://static.packt-cdn.com/products/9781787121515/graphics/c73d3c3b-1a05-4bb7-aa0b-42a3ea79b41b.png)
is a training dataset.
Training happens in steps. At every step, we choose a subset of our training set of size m (mini-batch) and use it to approximate loss function gradient with respect to parameters Θ:
![](https://static.packt-cdn.com/products/9781787121515/graphics/312af679-64c7-43a3-9254-44b6d40215aa.png)
Mini-batch training advantages are as follows:
- Gradient of the loss function over a mini-batch is a better approximation of the gradient over the whole training set then calculated over only one sample
- Thanks to the GPU you can perform computations in parallel on every sample in the batch, which is faster, then processing them one-by-one