OCR/OCV – reading clear text with a camera

OCR is the abbreviation for Optical Character Recognition and OCV stands for Optical Character Verification. This means clear text reading or clear text verification. While it used to be necessary to use a specific font for machine reading, this is no longer necessary today. A good example is passports, which have a machine-readable line and therefore had to be printed exactly with this font. In recent years, OCR systems have been further developed, so that today things are possible that would have been unthinkable some time ago. Today, because of the standardization of Windows fonts, OCR can be used reliably on documents without training. It is even possible to read narrow proportional fonts. A modern OCR system is able to recognize the format of a text, so that even multi-column documents can be processed automatically.

What actually is OCR?

Optical character recognition (OCR) is a technology that converts various documents into searchable and editable files. These can be, for example, PDF data, paper documents, but also digital images. If you want to extract relevant information from a brochure, a newspaper article or even a contract, for example, to reproduce it in Word format or to edit it in an Excel file, you can't just use a scanner. This is because the scanner only outputs a copy or an image of the document. This is a collection of pixels, i.e. image points, which can be white, black or colored. Of course, tables or raster graphics are also possible.

OCR software is needed to read and process these documents. It turns documents, PDF or digital images into words and sentences. Thus information can be stored readable and searchable. In addition, further processing is possible.

Text recognition in practice

Most optical input devices, such as digital cameras, scanners or faxes can only output raster graphics. This means that the dots arranged in the columns and lines are colored differently, the so-called pixels. In text recognition, however, letters must also be recognizable as letters. This is because they must be identified so that they can then be assigned a numerical value, which is assigned to them after text encoding. Such as by Unicode or ASCII, for example.

In German usage, the terms OCR and automatic text recognition are used as synonyms. However, this is incorrect. From a technical point of view, OCR describes the recognition of individual characters in separated parts of an image. This is preceded by recognition of the structures by first separating text blocks from the graphic elements. Then the line structures are recognized and individual characters are separated. The decision as to which text character is involved is made by means of certain algorithms in which a linguistic context is taken into account.

Previously, this required the use of specially designed fonts for automatic text recognition. I'm sure everyone still remembers the bottom line on a check form. This font was designed in such a way that the characters could be distinguished and read in by a special OCR reader without much computing effort and very quickly. The font used was called OCR-A and was characterized by the fact that very similar characters, such as the zero and the capital O, were printed in such a way that they were no longer similar. In contrast, OCR-B resembled a non-proportional and sans-serif font. OCR-H, on the other hand, was modeled on handwritten letters and numbers. Due to the fact that modern computers are more and more powerful and there are now also improved algorithms, it is now possible that even quite normal fonts can be recognized by the printer and even handwriting.

This is what modern OCR software can do

Modern text recognition software is now capable of performing context analysis. With the help of ICR (Intelligent Character Recognition), the result can be corrected and thus a character that was originally recognized as the number 8, for example, is automatically converted into a B because it is within a word. In this way, 8uchstaben becomes Buchstaben.

Text recognition is mainly used by larger companies, for example when it comes to automatically processing incoming mail. For example, documents need to be sorted in the inbox. For this task, however, it is not necessary to analyze the entire content. Instead, it is usually sufficient to distinguish according to rough characteristics. This can be, for example, a very specific layout of invoices or forms, a company logo or other characteristic features. Classification then takes place via pattern recognition, which refers to the defined passages and not to the entire document.

Advantages of OCR

OCR is used primarily to save time and costs in the creation of a wide variety of documents. This also applies to further processing and reuse. With OCR software, a paper document is scanned in order to process it later in a Word document or in an Excel file, for example, and then forward it on. It is also possible to take text passages from journals and books and use them in your own documents, working papers and studies without having to type out the quote or text passage.

Even on the road, with the help of a simple cell phone camera, it is now possible to capture text from timetables, posters, or banners and use the resulting information in a document. Of course, the same applies to text passages from books and paper documents, if there is no scanner available. In addition, the software can be used to create searchable archives. Modern programs now work so quickly that data conversion takes only a few seconds.

Further Information: