In the previous section, we learned about generating sequences of words from an input image. In this section, we will learn about generating sequences of characters with the image as input. Furthermore, we will learn about the CTC loss function, which helps in transcribing handwritten images.
Before we learn about the CTC loss function, let's understand the reason why the architecture that we saw in the image captioning section might not apply in handwritten transcription. Unlike in image captioning, where there is no straightforward correlation between the content in the image and the output words, in a handwritten image, there is a direct correlation between the sequence of characters present in the image and the sequence of output. Thus, we will follow a different architecture from what we designed in the previous section.
In addition, assume a scenario where an image is divided into 20 portions (assuming a scenario of a maximum of 20 characters...