Extracting text from an image
The process of extracting text from an image is called Optical Character Recognition (OCR). This can be very useful when the text data that needs to be processed is embedded in an image. For example, the information contained in license plates, road signs, and directions can be very useful at times.
We can perform OCR using Tess4j (http://tess4j.sourceforge.net/), a Java JNA wrapper for Tesseract OCR API. We will demonstrate how to use the API using an image captured from the Wikipedia article on OCR (https://en.wikipedia.org/wiki/Optical_character_recognition#Applications). The Javadoc for the API is found at http://tess4j.sourceforge.net/docs/docs-3.0/. The image we use is shown here:
Using Tess4j to extract text
The ITesseract
interface contains numerous OCR methods. The doOCR
method takes a file and returns a string containing the words found in the file, as shown here:
ITesseract instance = new Tesseract(); try { String result = instance.doOCR(new File...