Creating the dataset
In this chapter, we will take on the role of the bad guy. We want to create a program that can beat CAPTCHAs, allowing our comment spam program to advertise on someone's website. It should be noted that our CAPTCHAs will be a little easier that those used on the web today and that spamming isn't a very nice thing to do.
Our CAPTCHAs will be individual English words of four letters only, as shown in the following image:
Our goal will be to create a program that can recover the word from images like this. To do this, we will use four steps:
Break the image into individual letters.
Classify each individual letter.
Recombine the letters to form a word.
Rank words with a dictionary to try to fix errors.
Our CAPTCHA-busting algorithm will make the following assumptions. First, the word will be a whole and valid four-character English word (in fact, we use the same dictionary for creating and busting CAPTCHAs). Second, the word will only contain uppercase letters. No symbols, numbers...