Table of Contents
We put out a model that can recognize the collection of papers contained in a PDF or image made up of numerous documents. To accomplish this, the input PDF is divided into individual pages. The CNN model is used to categorize each page into the appropriate document category. After that, each document's data is extracted using OCR (optical character recognition). This is being recommended for five documents: voter identification, driver's license, PAN, and Aadhar. Except for the front and back of the same document, the input PDF must include a single document on a single page. Initially, our data classification model achieved an accuracy of 0.7342 on the training set and 0.7736 on the validation set, with gains of 0.6923 and losses of 0.8340.
In our ongoing efforts to enhance performance, we explored and discovered VGG16 and VGG19 models. Hyperparameter tuning was applied to our model, incorporating additional layers to the pre-trained models. As a result, we achieved a validation loss of 0.3677 and a validation accuracy of 0.8769 for VGG16.
In addition to this, we incorporated two more features:
- Utilizes text-to-speech technology for accessibility.
- Translates text into spoken words.
- Supports auditory learners and those with visual impairments.
- Enhances accessibility and consumability.
- Aids time-constrained users by condensing lengthy papers.
- Uses Hugging Face Transformers library for NLP models.
- Provides clear and instructive document synopses.
- Maximizes time efficiency by distilling crucial insights.
Hyperparameter tuning, regularization(early stopping,dropout), document split
- models: CNN, VGG16, VGG19 and OCR engine tesseract
- Google TTS(Text to speech), Hugging Face Transformers for text summarization
- Framework-Keras
When we began searching for an appropriate dataset, we observed that there is no publicly available dataset of identity documents as they hold sensitive and personal information. But we came across a dataset on Kaggle that consisted of six folders, i.e., Aadhar Card, PAN Card, Voter ID, single-page Gas Bill, Passport, and Driver's License. We added a few more images to each folder. These were our own documents that we manually scanned, with the rest coming from Google Images. Thus, these are the five documents we are classifying and extracting information from.
Originally, we implemented horizontal and vertical data augmentation through random flips to enhance dataset size and diversity. Currently, we have transitioned to utilizing image data generators for both the train and test sets.
Various hyperparameters like the number of layers, neurons in each layer, number of filters, kernel size, the value of p in dropout layers, number of epochs, batch size, etc. were changed until satisfactory training and validation accuracy was achieved.
The VGG model's architecture uses small convolution filters and deep structure that allows it to capture fine details, which is crucial for distinguishing between various ID documents that often have subtle differences.
4 additional layers were incorporated into the pre-trained model.
Before landing unto our final chosen model shown above, we tweaked the pre-trained architecture until satifactory results were acheived. ![Comparative results of identity document classification models]
Following are the steps of OCR done on images:
- Interactive Summarization and Query Answering
- Advanced Handwritten Text Extraction
- Global Accessibility with Multilingual Support
- Wider document classfication systems covering legal documents
- Exploring advanced CNN architectures