Document Classification with OCR and BERT

Overview

Document Classification with OCR and BERT is a project aimed at automatically categorizing textual images into predefined classes. This repository contains the code and resources necessary to train a powerful document classification model leveraging Optical Character Recognition (OCR) and the Bidirectional Encoder Representations from Transformers (BERT) algorithm. The model is deployed using FastAPI and Docker.

Project Highlights

Automated Document Classification: Classify textual images into categories without manual intervention, enabling efficient sorting and organization of large document datasets.
OCR Integration: Utilize Tesseract OCR, a popular open-source text recognition engine, to extract textual content from images, enabling the model to work with image-based documents.
BERT-based Document Understanding: Leverage BERT, a state-of-the-art language model, to understand the context and semantics of extracted text, improving the accuracy of document classification.
Flexibility and Customization: Adapt the project to your specific use case by easily modifying the number of classes, training data, and model architecture.

How it Works

Text Extraction with Tesseract OCR:
- Images containing textual content are processed using Tesseract OCR to extract the text.
- Extracted text is preprocessed and tokenized for further analysis.
BERT Model Training:
- Preprocessed text data and corresponding labels are used to train a BERT-based document classification model.
- The model learns to classify documents into predefined categories.
Inference and Classification:
- Trained model is utilized to classify new textual images into appropriate classes.
- Predictions enable automated sorting and organization of documents based on their content.

Prerequisites

Python 3.x (Mine - Python 3.10.12 in Ubuntu 22.04)
Libraries: transformers, torch, pytesseract, PIL, FastAPI, Gradio
Tesseract OCR Installed
Docker

Usage

Clone the Repository: git clone https://github.com/yesdeepakmittal/document-classification-using-bert.git
Make a virtual environment and install all the libraries mentioned in requirements.txt file
Train the model using Jupyter Notebook
Serve the model using FastAPI and Deploy using Docker

Challenges & Remedies

Computational:
- Training a BERT model trains well if we have a dedicated GPU.
  - Remedy: Utilized the GPU in Google Colab
- Preprocessing text of a single document require at least 30 seconds making it infeasible working with 1000s of document
  - Remedy: Run the preprocessing task at More Core processor and save the processed text as a .txt file.
OCR Engine Performance:
- Input to the model is the text which is extracted using OCR Engine. The more accurate the OCR Engine is, the better the model fine-tuning will be.
  - Remedy: Premium OCR Engine like Google Vision OCR performs well and give the result faster as compare to Tesseract OCR Engine which is used in this project.
Data Quality & Quantity:
- BERT models require large amounts of data for effective training, and obtaining a substantial, well-labeled dataset can be challenging, especially for specific domains.
  - Remedy:
    - Data Augmentation: Apply techniques such as synonym replacement to artificially increase the size of your dataset.
    - Domain-Specific Pretraining: Consider using domain-specific pretrained BERT models.
Training Challenges:
- Training large transformer models like BERT can be time-consuming, especially if the dataset is vast and the model architecture is complex.
  - Remedy:
    - Gradient Accumulation: simulate training with larger batch sizes without increasing GPU memory requirements significantly.
Fine-Tuning Challenges:
- Finding the optimal learning rate, batch size, and number of epochs for fine-tuning BERT can be challenging and time-consuming.
  - Remedy: Hyperparameter Tuning with multiple values & Early Stopping.
Label Imbalance:
- classes might not be balanced, leading to biased models.
  - Remedy: Assign higher weights to minority classes during loss calculation to penalize misclassifications of minority classes more.
Python Version Difference:
- Incompatible Python version to load the pretrained model for serving.
  - Remedy: Use the correct Python Version for model training and model loading.

Data Source

Kaggle

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
modeling_service		modeling_service
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
document_classification_using_bert.ipynb		document_classification_using_bert.ipynb
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Classification with OCR and BERT

Overview

Project Highlights

How it Works

Prerequisites

Usage

Challenges & Remedies

Data Source

About

Releases

Packages

Languages

yesdeepakmittal/document-classification-using-bert

Folders and files

Latest commit

History

Repository files navigation

Document Classification with OCR and BERT

Overview

Project Highlights

How it Works

Prerequisites

Usage

Challenges & Remedies

Data Source

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages