All the magic in this model lies behind the RNN cells. In our simple example, each cell presents the same equations, just with a different set of variables. A detailed version of a single cell looks like this:
First, let's explain the new terms that appear in the preceding diagram:
- Weights (
,
,
): A weight is a matrix (or a number) that represents the strength of the value it is applied to. For example,Â
determines how much of the inputÂ
should be considered in the following equations.
If Â
consists of high values, thenÂ
should have significant influence on the end result. The weight values are often initialized randomly or with a distribution (such as normal/Gaussian distribution). It is important to be noted that Â
,Â
, andÂ
 are the same for each step. Using the backpropagation algorithm, they are being modified with the aim of producing accurate predictions
- Biases (
,
): An offset vector (different for each layer), which adds a change to the value of the output 
- Activation function (tanh): This determines the final value of the current memory stateÂ
 and the outputÂ
. Basically, the activation functions map the resultant values of several equations similar to the following ones into a desired range: (-1, 1) if we are using the tanh function, (0, 1) if we are using sigmoid function, and (0, +infinity) if we are using ReLu (https://ai.stackexchange.com/questions/5493/what-is-the-purpose-of-an-activation-function-in-neural-networks)
Now, let's go over the process of computing the variables. To calculateÂ
andÂ
, we can do the following:
As you can see, the memory stateÂ
is a result of the previous valueÂ
and the inputÂ
. Using this formula helps in retaining information about all the previous states.
The inputÂ
is a one-hot representation of the word volunteer. Recall from before that one-hot encoding is a type of word embedding. If the text corpus consists of 20,000 unique words and volunteer is the 19th word, thenÂ
is a 20,000-dimensional vector where all elements are 0 except the one at the 19th position, which has a value of 1, which suggests that we only taking into account this particular word.
The sum betweenÂ
,Â
, andÂ
is passed to the tanh activation function, which squashes the result between -1 and 1Â using the following formula:
In this, e = 2.71828 (Euler's number) and z is any real number.
The outputÂ
at time step t is calculated usingÂ
 and the softmax function. This function can be categorized as an activation with the exception that its primary usage is at the output layer when a probability distribution is needed. For example, predicting the correct outcome in a classification problem can be achieved by picking the highest probable value from a vector where all the elements sum up to 1. Softmax produces this vector, as follows:
In this, e = 2.71828 (Euler's number) and z is a K-dimensional vector. The formula calculates probability for the value at the ith position in the vector z.
After applying the softmax function,Â
becomes a vector of the same dimension asÂ
(the corpus size 20,000) with all its elements having a total sum of 1. With that in mind, finding the predicted word from the text corpus becomes straightforward.