Essentially, we are backpropagating our errors through several time steps, reflecting the length of a sequence. As we know, the first thing we need to have to be able to backpropagate our errors is a loss function. We can use any variation of the cross-entropy loss, depending on whether we are performing a binary task per sequence (that is, entity or not, per word à binary cross-entropy) or a categorical one (that is, the next word out of the category of words in our vocabulary à categorical cross entropy). The loss function here computes the cross-entropy loss between a predicted output and actual value (y), at time step, t:
( log - [ (1-
This function essentially lets us perform an element-wise loss computation of each predicted and actual output, at each time step for our recurrent layer. Hence, we generate a loss value at each prediction...