Training the network
Stochastic gradient descent (SGD) is an effective way of training deep neural networks. SGD seeks such parameters Θ of the network, which minimize the loss function ℒ.
data:image/s3,"s3://crabby-images/27015/27015b9bc29793682b484e5f5d6c7a7e0570018e" alt=""
Where
data:image/s3,"s3://crabby-images/89376/8937674446ebf8be6ee8b4ecfe1984de73594aef" alt=""
is a training dataset.
Training happens in steps. At every step, we choose a subset of our training set of size m (mini-batch) and use it to approximate loss function gradient with respect to parameters Θ:
data:image/s3,"s3://crabby-images/92f14/92f14bbba8c94c5d7d2b3a52783b1a073911e6fb" alt=""
Mini-batch training advantages are as follows:
- Gradient of the loss function over a mini-batch is a better approximation of the gradient over the whole training set then calculated over only one sample
- Thanks to the GPU you can perform computations in parallel on every sample in the batch, which is faster, then processing them one-by-one