One of the limitations to perform supervised learning on top of handwritten text recognition or in speech transcription is that, using a traditional approach, we would have to provide the label of which part of the image contain a certain character (in the case of hand-writing recognition) or which subsegment of the audio contains a certain phoneme (multiple phonemes combine to form a word utterance).
However, providing the ground truth for each character in image, or each phoneme in speech transcription, is prohibitively costly when building the dataset, where there are thousands of words or hundreds of hours of speech to transcribe.
CTC comes in handy to address the issue of not knowing the mapping of different parts of images to different characters. In this section, we will learn about how CTC loss functions.