Skip to content

wasdac9/aadhaar-ocr

Repository files navigation

aadhaar-ocr

Extract Aadhaar card details like Name, date of Birth, Gender, Mobile No., Aadhaar No.(UID) and Address using Tesseract OCR.

Requirements

  1. opencv-python 4.5.3.56 or above
  2. pytesseract 0.3.8 or above
  3. spacy 3.2.1 or above
  4. numpy 1.20.0 or above

Downloading Tesseract OCR

Along with above requirements you also need Tesseract OCR Engine.

Download Tesseract OCR for windows

  1. 32-bit version: https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v5.0.1.20220118.exe
  2. 64-bit version: https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.1.20220118.exe

More info at https://github.com/UB-Mannheim/tesseract/wiki Note: Download the file and extract the contents of the file. Keep a note of the path of tesseract.exe(Eg: Desktop\Tesseract\tesseract.exe)

Goal of Project

Aadhaar card details are required in places like Banks, Motor Training Schools, School Admissions, College Admissions, etc. Generally in such places they ask for an Aadhaar Card xerox copy and also ask you to fill certain paper work regarding the context of the visit(for example, admission form in motor training schools). A lot of the fields in these forms ask for details that can be found on Aadhaar Card. The goal of this project is to extract Aadhaar Card Details using computer vision and Optical Character Recognition(OCR). Instead of manually filling these details on the form again, these details can be automatically acquired from the front and back images of Aadhaar Card. This can reduce issues like spelling mistakes or incorrect number mistakes or invalid information etc.

Project Info

Extract details like Name, Date of Birth, Gender, Mobile No., Aadhaar No., and Address directly from Aadhaar Card image using OCR. You will require two images of your Aadhaar Card. First Image should be Front Side of Aadhaar Card and Second Image should be Back Side of Aadhaar Card.

Code

Setting up path to tesseract.exe(tesseract.exe can be found at download location of Tesseract OCR Engine Eg: Desktop\Tesseract\tesseract.exe), aadhaar_front_img and aadhaar_back_img

main.py

In main.py set the following paths

tesseract_path = Path("<path/to/tesseract.exe>") // set tesseract.exe path
aadhaar_front_img_path = Path("<path/to/aadhaar_front_image>") // set aadhaar front image path
aadhaar_back_img_path = Path("<path/to/aadhaar_back_image>") // set aadhaar back image path
pytesseract.pytesseract.tesseract_cmd = tesseract_path

After setting path you can tweak some optional values like fx,fy to resize the original image.

img = cv2.resize(img,(0,0),fx=0.5,fy=0.5)
// Resize image (fx=0.5,fy=0.5 is half the original size and fx=2,fy=2 is double the original size)

Now you can run main.py

Running main.py

First image pop-up

On running main.py you will get a grayscale aadhaar front image first, here you will have to choose four points to crop the image so that we only keep data part of the image.

The order of choosing the points

Order : TopLeft(1)=>TopRight(2)=>BottomLeft(3)=>BottomRight(4)

The points are marked in red with their order(try to choose points similar to the image below, we only need data part of the image). The points can be marked on the image by moving the cursor to a location and clicking left mouse button.

Note: Since the original image is grayscale the points are white in color

The four points need not form an exact rectangle it can form any quadrilateral because we run perspective transform on the image in the code.

alt text

Second image pop-up

The second image window will be aadhaar back image.

Choose the points in the same order as before, crop the image similar to the image below (we need only the address excluding the "Address :" part)

alt text

Output

The details extracted using ocr will be stored in a JSON file in the same root directory by the name aadhaar_info_.json The values that were not found by OCR will be set to null in JSON.

Summary

  1. Tesseract OCR is an open source OCR Engine created by Google.
  2. The accuracy of tesseract OCR is found to be unreliable.
  3. Tesseract OCR requires a lot of pre-processing of the image to get good results.
  4. This code uses Name Entity Recognition(NER) to find the name of aadhaar card holder from the image, but while experimentation it was found that NER was not working very well with in detecting indian names from the string generated by OCR.

Future Work

  1. Using a different method or improving NER for detecting names from aadhaar card image.
  2. A better cropping system can be implemented like an 8-edge image cropper, which looks more user friendly instead of selecting four dots on the image.
  3. Similar projects can be made to extract details from other government cards like PAN card, Driving License, etc.
  4. An mobile app can be made on this project and it can be deployed on a mobile device which consists of a camera, flash and enough computation power to carry out OCR.