Skip to content

Our Document Interaction Assistant optimizes document tasks with advanced machine learning, OCR, and efficient recognition of various documents. The project prioritizes user-friendly interactions with PDFs and images, featuring "Read Aloud" for enhanced accessibility and "Document Summarization" for efficient and concise summaries.

Notifications You must be signed in to change notification settings

PrincySinghal/Document-Interaction-Assistant

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document-Interaction-Assistant

Table of Contents
  1. About The Project
  2. Salient Features
  3. Description
  4. Data Preprocessing
  5. Document Classification Model
  6. Results
  7. Information extraction model
  8. Team

About the project

We put out a model that can recognize the collection of papers contained in a PDF or image made up of numerous documents. To accomplish this, the input PDF is divided into individual pages. The CNN model is used to categorize each page into the appropriate document category. After that, each document's data is extracted using OCR (optical character recognition). This is being recommended for five documents: voter identification, driver's license, PAN, and Aadhar. Except for the front and back of the same document, the input PDF must include a single document on a single page. Initially, our data classification model achieved an accuracy of 0.7342 on the training set and 0.7736 on the validation set, with gains of 0.6923 and losses of 0.8340.

In our ongoing efforts to enhance performance, we explored and discovered VGG16 and VGG19 models. Hyperparameter tuning was applied to our model, incorporating additional layers to the pre-trained models. As a result, we achieved a validation loss of 0.3677 and a validation accuracy of 0.8769 for VGG16.

In addition to this, we incorporated two more features:

1. Read Aloud:

  • Utilizes text-to-speech technology for accessibility.
  • Translates text into spoken words.
  • Supports auditory learners and those with visual impairments.
  • Enhances accessibility and consumability.

2. Document Summarization:

  • Aids time-constrained users by condensing lengthy papers.
  • Uses Hugging Face Transformers library for NLP models.
  • Provides clear and instructive document synopses.
  • Maximizes time efficiency by distilling crucial insights.

Salient Features

Hyperparameter tuning, regularization(early stopping,dropout), document split

Tech stack used

  • models: CNN, VGG16, VGG19 and OCR engine tesseract
  • Google TTS(Text to speech), Hugging Face Transformers for text summarization
  • Framework-Keras

User Flow

image

Data Description

When we began searching for an appropriate dataset, we observed that there is no publicly available dataset of identity documents as they hold sensitive and personal information. But we came across a dataset on Kaggle that consisted of six folders, i.e., Aadhar Card, PAN Card, Voter ID, single-page Gas Bill, Passport, and Driver's License. We added a few more images to each folder. These were our own documents that we manually scanned, with the rest coming from Google Images. Thus, these are the five documents we are classifying and extracting information from. image

Data Preprocessing

Originally, we implemented horizontal and vertical data augmentation through random flips to enhance dataset size and diversity. Currently, we have transitioned to utilizing image data generators for both the train and test sets.

Document Classification Model

CNN model

image

Various hyperparameters like the number of layers, neurons in each layer, number of filters, kernel size, the value of p in dropout layers, number of epochs, batch size, etc. were changed until satisfactory training and validation accuracy was achieved.

image

image

CNN Model results

image

image

VGG16

The VGG model's architecture uses small convolution filters and deep structure that allows it to capture fine details, which is crucial for distinguishing between various ID documents that often have subtle differences. image

4 additional layers were incorporated into the pre-trained model.

image

Before landing unto our final chosen model shown above, we tweaked the pre-trained architecture until satifactory results were acheived. ![Comparative results of identity document classification models]image

Information extraction model

Following are the steps of OCR done on images:

image

image

Ongoing Improvements:

  1. Interactive Summarization and Query Answering
  2. Advanced Handwritten Text Extraction
  3. Global Accessibility with Multilingual Support
  4. Wider document classfication systems covering legal documents
  5. Exploring advanced CNN architectures

Team

About

Our Document Interaction Assistant optimizes document tasks with advanced machine learning, OCR, and efficient recognition of various documents. The project prioritizes user-friendly interactions with PDFs and images, featuring "Read Aloud" for enhanced accessibility and "Document Summarization" for efficient and concise summaries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%