Text Extraction from Images using OCR and Python
Humans can easily read the content on the image by looking into it without any difficulty. But machines can’t do this like humans. It can understand the contents if it’s organized. That’s why we are moving towards the OCR abbreviated as Optical Character Recognition. We are going to use the tesseract OCR engine with Python to recognize the text on the image.
- Tesseract OCR
It is used to extract the text from the image. It’s an open source text recognition engine owned by Google LLC. It is one of the most popular engines for text extraction and gives high accurate results. For API calls you can choose any programming language and frameworks.
- Python
As we all know python is one of the most frequently used programming languages for OCR technology like this kind of stuff and obviously it is also an open source.
Requirements
Download and Install:
- Python from https://python.org/
- Tesseract from https://osdn.net/projects/sfnet_tesseract-ocr-alt/downloads/tesseract-ocr-setup-3.02.02.exe/ (For windows)
Packages installation
Note: ‘python’ command can be ‘py’ etc. if you have multiple versions of python
python -m pip install pytesseract
Implementation:
- First we need to import pytesseract package
import pytesseract
- Configuring the path of the installed tesseract engine
pytesseract.pytesseract.tesseract_cmd = r”C:\Program Files\Tesseract-OCR\tesseract.exe”
- Assign a image that contains text content
image_path = “bill1.jpg”
- Make an API call to extract the text from the image by using the image_to_string method with 3 arguments. First argument is the image path, second one is the language of the text and the final one is configuration
textContentFromImage = pytesseract.image_to_string(image_path, lang=’eng’, config=’–psm 6′ )
Source code:
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r”C:\Program Files\Tesseract-OCR\tesseract.exe”
image_path = r”C:\Users\suresh\Desktop\image1.jpg”
textContentFromImage = pytesseract.image_to_string(image_path, lang=’eng’, config=’–psm 6′ )
print(textContentFromImage)
Save the above code in the text file with the extension of .py like imageExtraction.py
Then, execute the python script by
python imageExtraction.py
Sample input image(image1.jpg):
Result:
DON’ T
STOP
UNTIL
YOU’RE
PROUD
Some other OCR tools:
- Amazon Textract
- Google Vision API