Viterbi decoding
A straightforward way to predict the sequence of labels is to output the label that has the highest activation from the previous layers of the network. However, this could be sub-optimal as it assumes that each label prediction is independent of the previous or successive predictions. The Viterbi algorithm is used to take the predictions for each word in the sequence and apply a maximization algorithm so that the output sequence has the highest likelihood. In future chapters, we will see another way of accomplishing the same objective through beam search. Viterbi decoding involves maximizing over the entire sequence as opposed to optimizing at each word of the sequence. To illustrate this algorithm and way of thinking, let's take an example of a sentence of 5 words, and a set of 3 labels. These labels could be O, B-geo, and I-geo as an example.
This algorithm needs the transition matrix values between labels. Recall that this was generated and stored...