Text Extraction from Images using OCR and Python

December 21, 2020

Leave a comment

saadhvi saadhvi

Text Extraction from images with Tesseract OCR and Python

Humans can easily read the content on the image by looking into it without any difficulty. But machines can’t do this like humans. It can understand the contents if it’s organized. That’s why we are moving towards the OCR abbreviated as Optical Character Recognition. We are going to use the tesseract OCR engine with Python to recognize the text on the image.

Tesseract OCR

It is used to extract the text from the image. It’s an open source text recognition engine owned by Google LLC. It is one of the most popular engines for text extraction and gives high accurate results. For API calls you can choose any programming language and frameworks.

Python

As we all know python is one of the most frequently used programming languages for OCR technology like this kind of stuff and obviously it is also an open source.

Requirements

Download and Install:

Python from https://python.org/
Tesseract from https://osdn.net/projects/sfnet_tesseract-ocr-alt/downloads/tesseract-ocr-setup-3.02.02.exe/ (For windows)

Packages installation

Note: ‘python’ command can be ‘py’ etc. if you have multiple versions of python

python -m pip install pytesseract

Implementation:

First we need to import pytesseract package

import pytesseract

Configuring the path of the installed tesseract engine

pytesseract.pytesseract.tesseract_cmd = r”C:\Program Files\Tesseract-OCR\tesseract.exe”

Assign a image that contains text content

image_path = “bill1.jpg”

Make an API call to extract the text from the image by using the image_to_string method with 3 arguments. First argument is the image path, second one is the language of the text and the final one is configuration

textContentFromImage = pytesseract.image_to_string(image_path, lang=’eng’, config=’–psm 6′ )

Source code:

import pytesseract

pytesseract.pytesseract.tesseract_cmd = r”C:\Program Files\Tesseract-OCR\tesseract.exe”

image_path = r”C:\Users\suresh\Desktop\image1.jpg”

textContentFromImage = pytesseract.image_to_string(image_path, lang=’eng’, config=’–psm 6′ )

print(textContentFromImage)

Save the above code in the text file with the extension of .py like imageExtraction.py

Then, execute the python script by

python imageExtraction.py

Sample input image(image1.jpg):

Result:

DON’ T

STOP

UNTIL

YOU’RE

PROUD

Some other OCR tools:

Amazon Textract
Google Vision API

2560

Posts

Text Extraction from Images using OCR and Python

Leave a Reply Cancel reply