Text recognition (Optical Character Recognition)

What is text recognition?

Optical Character Recognition (OCR) converts analogue text into editable digital text. For example, a printed form is scanned and converted by the OCR software into a text document on the computer, which can then be searched, edited and saved.

Modern OCR text recognition is able to correctly recognise over 99 % of the text information. Words that are not recognised are marked by the programme and corrected by the user.

To further improve the results, OCR text recognition is often supplemented with methods of context analysis (Intelligent Character Recognition, ICR for short). For example, if the text recognition software has recognised "2room", the "2" is corrected to a "Z", resulting in the output of the word "room", which makes sense in context.

There is also Intelligent Word Recognition (IWR), which is supposed to solve the problems of recognising flowing handwriting.

Some examples of free and paid optical character recognition software (in alphabetical order):

  • ABBYY FineReader PDF
  • ABBYY FlexiCapture
  • Adobe Acrobat Pro DC
  • Amazon Textract
  • Docparser
  • FineReader
  • Google Document AI
  • IBM Datacap
  • Klippa
  • Microsoft OneNote
  • Nanonets
  • OmniPage Ultimate
  • PDF Reader
  • Readiris
  • Rossum
  • SimpleOCR
  • Softworks OCR
  • Soda PDF
  • Veryfi

Write an OCR text recogniser yourself with Python or C#

It is possible to work with the programming languages Python or C# itself to incorporate text recognition into scripts. This requires the free OCR library Tesseract, which works for Linux and Windows.

This approach provides a customisable text recognition solution for both scans and photos.

How does Optical Character Recognition software work?

The basis is the raster graphic (image copy of the text), which is created with the help of a scanner or a camera from the physically existing text, for example a book page. The text recognition of a photo is usually more difficult here than with a scan, where the image copy provides very similarly good conditions. With a photo, exposure and the angle at which the document was taken can cause problems, but these can be corrected through the use of AI.

After that, the OCR software works in 3 steps:

1. recognition of the page and outline structure

The scanned graphic is analysed for dark and light areas. Normally, the dark areas are identified as characters to be recognised and the light areas as background.

2. pattern or feature recognition

This is followed by further processing of the dark areas to find alphabetic letters or numeric digits. The approach of the various OCR solutions differs in whether only one character, one word or a text block is recognised at a time. The characters are identified using pattern or feature recognition:

Pattern recognition: The OCR programme compares the characters to be checked with its database of text examples in different fonts and formats and recognises identical patterns.

Feature recognition: The OCR programme applies rules regarding the features of a particular letter or number. Features can be, for example, the number of angled lines, crossed lines or curves in a character.

For example, the information for the letter "F" consists of a long vertical line and 2 short rectangular lines.

3. coding in output format and error control

Depending on the area of application and the software used, the document is saved in different formats. For example, it is output as a Word or PDF file, or saved directly in a database.

In addition, the last step also involves error checking by the user to manually correct words or characters that are not recognised.

How does AI support text recognition?

On the one hand supports Artificial Intelligence (AI) in text recognition already during the optimisation of the raster graphics, especially with photos. If the document to be read in is bent or creased, the text is sometimes too slanted or distorted, which causes problems for the OCR software during processing. With photos, poor exposure and an unsuitable shooting angle can also lead to bad conditions for the OCR software.

With the help of AI, the document can be "smoothed" in its structure, the lighting optimised and the angle corrected, and thus again offers good conditions for text recognition.

On the other hand, AI improves the results of text recognition itself. Artificial intelligence learns with every text and every corrected error. In this way, the errors in text recognition are constantly minimised and the OCR software constantly delivers better results.

Data Navigator Newsletter