Extracting text from images with Tesseract OCR, OpenCV, and Python
It is easy for humans to understand the contents of an image by just looking at it. You can recognize the text on the image and can understand it without much difficulty. However, computers don’t function similarly. They only understand information that is organized. And this is exactly where Optical Character Recognition comes in the picture. In my previous blog, I explained the basics of OCR and 3 important things that you should be aware of about OCR. As promised to my readers, I am back with my second blog. This time I am going to elaborate more on OCR especially about extracting information from an image. And just like always, with automation, you can take this to the next level. Automating the task of extracting text from images will help you to maintain and to analyze records. This blog majorly focuses on the OCR’s application areas using Tesseract OCR, OpenCV, installation & environment setup, coding, and limitations of Tesseract. So, let’s begin.
Tesseract OCR
Tesseract is an open-source text recognition engine that is available under the Apache 2.0 license and its development has been sponsored by Google since 2006. In the year 2006, Tesseract was considered as one of the most accurate open-source OCR engines. You can use it directly or can use the API to extract the printed text from images. The best part is that it supports an extensive variety of languages. It is through wrappers that Tesseract can be made compatible with different programming languages and frameworks. In this blog, I’ll be using the Python wrapper named pytesseract. It is used to recognize text from a large document, or it can also be used to recognize text from an image of a single text line. Below is the visual representation of the Tesseract OCR architecture as represented in the Voting-Based OCR System research paper.
Talking about the Tesseract 4.00, it has a configured text line recognizer in its new neural network subsystem. These days people typically use a Convolutional Neural Network (CNN) to recognize an image that contains a single character. Text that has arbitrary length and a sequence of characters is solved using Recurrent Neural Network (RNNs) and Long short-term memory (LSTM) where LSTM is a popular form of RNN. The Tesseract input image in LSM is processed in boxes (rectangle) line by line that inserts into the LSTM model and gives the output.
By default, Tesseract considers the input image as a page of text in segments. You can configure Tesseract’s different segmentations if you are interested in capturing a small region of text from the image. You can do it by assigning –psm mode to it. Tesseract fully automates the page segmentation but it does not perform orientation and script detection. The different configuration parameters for Tesseract are mentioned below:
Page Segmentation Mode (–psm): By configuring this, you can assist Tesseract in how it should split an image in the form of texts. The command-line help has 11 modes. You can choose the one that works best for your requirement from the table given below…read more.