A CAPTCHA is a system intended to prevent automated access or scraping. It does so by asking questions that are meant to recognize when the user is a human and when the user is a program. You have probably seen countless variations of the following screenshot:
Sometimes, the request is to insert a code, sometimes it is to select some objects, for example, storefronts or traffic lights in a series of images, and sometimes the CAPTCHA is a math question. In this chapter, we are going to break a simple CAPTCHA system, called Really Simple CAPTCHA:
Despite its simplicity, Really Simple CAPTCHA is still widely used. Most importantly, it will illustrate how to approach breaking other, more complicated, CAPTCHA systems.
The first step will be to process the CAPTCHA dataset so that it is convenient for machine learning. The most naive approach to the problem is likely...