Parameters in an LSTM
LSTMs are built on plain RNNs. If you simplified the LSTM and removed all the gates, retaining only the tanh function for the hidden state update, you would have a plain RNN. The number of activations that the information – the new input data at time t
and the previous hidden state at time t-1
(x
t and h
t-1) – passes through in an LSTM is four times the number that it passes through in a plain RNN. The activations are applied once in the forget gate, twice in the update gate, and once in the output gate. The number of weights/parameters in an LSTM is, therefore, four times the number of parameters in a plain RNN.
In Chapter 5, Deep Learning For Sequences, in the section titled Parameters in an RNN, we calculated the number of parameters in a plain RNN and saw that we already have a quite a few parameters to work with (n
2 + nk + nm
, where n
is the number of neurons in the hidden layer, m
is the number of inputs, and k
is the dimension of the output...