Tesseract OCR on Identity Documents

5 min readFeb 6, 2021

Last week, I received a request to transcribe 21,000 passports and national identity documents. My lack of patience and passion to read identity cards for any number of hours drove me to write a script to solve this tedious task.

The fun began immediately — when I browsed the dataset,

The documents are from multiple countries and so contain text from different languages.
The dataset is made of photographs of the documents and not scans with consistent dimensions.

The Dataset

A photograph of a hand holding a Dutch driving license. — An example input image from which to extract text. Source

The photograph of a driving license from the Netherlands is a good sample from the dataset. It is a good sample because the desired information is readable but also has artefacts such as the background and the hand holding the card. The card itself is realistic as they would be used in real-life and not an enhanced copy for digital use. The language of the card is Dutch and not English. Although this should not be a problem as the target texts from this datapoint are proper names, dates and numbers.

For the above input image, a solution is considered to be successful if it extracts the following information which I refer as target text,

Mesters, Lambertus H, 05.01.1979, Eindhoven, 31.01.2017, 31.01.2027, Gemeente Waaire, 5740344641, AM-BE, D1NLD1574034464194NB44MK362D54

I cannot do a co-ordinate based crop because the position of the card in the next input image could be different. So I have to build a more robust solution which works for reasonable variations. My weapon of choice to tackle the transcribing problem is to use OCR or Optical Character Recognition.

What is OCR?

Optical character recognition or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image. Wikipedia

OCR is a technology which recognizes text in an image. If I apply OCR correctly, I should be able to extract the target text from the input image.

How to OCR?

To apply OCR, I need some useful libraries for two reasons,

To read and manipulate the input image. For this, I use OpenCV and NumPy.
I want to avoid building my own OCR model so I use pytesseract.

I use Python but this solution can also be written in other programming languages like JavaScript.

My Algorithm

Read input image
Crop document from image photograph
Perform OCR on the cropped image
Output predicted text

I crop the document from the photograph to remove the pixels of the background and hand which are not needed to predict the target text. The document detection is done via OpenCV and the cropped image looks like,

A photograph of the Dutch driving license from the input image. — A cropped version of the input image.

The OCR part of the code is only one line of code because we use the pytesseract python package.

And not for the moment of truth — drumroll, please. I present to you the predicted text,

WA RIBEWIS 222 Le Mesters i! Lam 05.01.1979 Eindhoven 4a 31.01.2017 4» 31.01.2027 hi ac Gemeente Waaire 5740344641 AM-BE Fe — Text recognized by pytesseract on the input image. Words in **bold** are those which match the target text.

The predicted text in bold match the words from the target text. The predicted text has some misspelt words. However, the predicted text has recognized all of the target text (at least partially).

Experiments

I made the following experiments to close the gap between the predicted text and the target text. I’ll spare the details to avoid information overload and only present my conclusions,

I converted the input image to grayscale because I believed the colour of the text does not matter. I was wrong — the predictions on the monochrome version of the input image had more mistakes.
I applied OpenCVs’ Simple Thresholding on the input image. The predictions did get worse for some variations but not better.
Tesseract provides options for which OcrEngineMode (OEM) to use when making predictions. I used the default OEM.
The language support by Tesseract is excellent. Tesseract can recognize more than 100 languages “out of the box”. I used eng+nld for the language setting.
This particular setting gave me the “aha” moment during this project. Tesseract, by default, assumes that the input image contains a page of text. The PSM option allowed me to tell tesseract to look for sparse words in no particular order instead of full sentences.
Each prediction is accompanied with metadata which could help with post-processing. I filtered the predicted text on the prediction confidence to improve the prediction score. I used the Jaccard Index as the score. Using 70% as confidence threshold yielded the highest score.

Mesters Eindhoven 31.01.2017 31.01.2027 ac Gemeente Waaire AM-BE — Text recognized by pytesseract on the input image with confidence > 70%. Words in **bold** are those which match the target text.

At this point, I paused my effort to re-evaluate the problem I was trying to solve. Words were being detected but not completely and not identifiable — no way to tell which word is the first name or which date is the date of birth. When I ran my script on other data points, the performance was not consistent.

Despite promising results; there was, and is, no way to use this solution with practical expectations.

What kept nagging me was, I’ve used services where I upload my passport scans and they’re able to detect all my information correctly. The scans I make are more constrained and text is unambiguous.

Reduce Complexity

I found a control image to test the efficacy of my script.

A well-oriented and clear scan of a Dutch national identity card. — The Wikipedia image for its entry on the Dutch identity card. Source

Let’s do this one more time… drumroll, please.

A much more comprehensive list of words which appear to belong on a national identity document. — Predictions on the **control image**.

The misspelling issue persists but much less so and not on the target text. An improvement in comparison to my first attempt.

Unfortunately, the dataset which needs to be transcribed is not of the same quality as the control image. Which makes my script a failed attempt but it was a lot of fun to try.

My Code

If you would like to give it a try or I hope it works better on your dataset, here is my script,

The scan.py from my GitHub account. Source

Please do share if you find a more robust solution or provide feedback.

Thank you for your time and attention.