We now know how the memory at time (t) is calculated, but how about the contender (c ̴t) itself? After all, it is partially responsible for maintaining a relevant state of memory, characterized by possibly useful representations occurring at each timestep.
This is the same idea that we saw in the GRU unit, where we allow the possibility for memory values to be updated using a contender value at each timestep. Earlier, with the GRU, we used a relevance gate that helped us compute it for the GRU. However, that is not necessary in the case of the LSTM, and we get a much simpler and arguably more elegant formulation as follows:
- Contender memory value (c ̴t ) = tanh ( Wc [ at-1, t ] + bc)
Here, Wc is a weight matrix that is initialized at the beginning of a training session, and iteratively updated as the network trains. The dot product of...