Using optical character recognition to find sensitive information in images
At times, when performing penetration tests, you might run across a large number of images. This could be images in S3 buckets or on a file share. It could also be images uploaded to a helpdesk service ticket or JIRA—maybe screenshots from an application where customers can submit bugs. These images might contain Personal Identifiable Information (PII), and sometimes even passwords or access keys.
Important Note
Technically, you are more likely to find sensitive data such as credit card numbers and phone numbers in images rather than the password.
A useful tool for performing optical character recognition is Tesseract, which was originally developed by HP and can be found here: https://github.com/tesseract-ocr/tesseract.
The following steps describe how to set up and use Tesseract:
- To get started on Ubuntu, just install it with
apt
:$ sudo apt install tesseract-ocr
- Afterward,...