OCR engines have been developed into many kinds of domain-specific OCR applications, such as receipt OCR, invoice OCR, check OCR, legal billing document OCR. They can be used for: Data entry for business documents, e.g. Cheque, passport, invoice, bank statement and receipt. Automatic number plate recognition. This project focuses to build a OCR engine with the help of datasets and images.
Some of the avalable OCR engines are:
- tesseract-ocr By google, GITHUB REPO | Documentation
- keras-ocr, GITHUB REPO | Documentation
- EasyOCR by Jaided AI , GITHUB REPO | Documentation
- Text detection with MSER and SWT by @azmiozgen, GITHUB REPO
- TeOCR by Hugging Face , Overview
- docTR by mindee, GITHUB REPO | Documentation
Some of the available datasets for testing and training a OCR engine:
- list open dataset about ocr.
- Keras OCR datatset
- IAM Handwriting
- ICDAR 2003
- TextOCR
- FUNSD (Form Understanding in Noisy Scanned Documents)
- ST-VQA
- SciTSR
- TextCaps
- DocBank
- Kannada-MNIST
- MLe2e
Drive Link of some datasets for testing and training a OCR engine:
- Attention-based Extraction of Structured Information from Street View Imagery
- TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
- Adapting the Tesseract Open Source OCR Engine for Multilingual OCR
- PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System
- LayoutReader: Pre-training of Text and Layout for Reading Order Detection
- PP-OCR: A Practical Ultra Lightweight OCR System
- Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks
- End-to-End Interpretation of the French Street Name Signs Dataset
- MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding
- Task 0: As we need to train the custom model as well as the pretrained models so we need datasets,please add datasets links or download them inside a drive and make hyper link in the datasets sections in the readme.md and complete the Task 0.
- Task 1: There are three folders given Newspaper ,Posters and Sheets,go inside one folder, you can find a image there, as a sample, please find similar images only and push them inside perticuler folders, minimum 50 images inside a folder will be enough to make the dataset.
- Task 2: In this task you have to make a jupyter notebook and in that try to use some of the given libraries in the readme section and you have to test their output using the images in the test images,and contribute a jupyter notebook as a name like this: Name_of_the_contributer.ipynb.
- Task 3: This is the last step of the project, as you have tried all the libraries,make a custom model using the datasets and the take the help of the research papers as well as you mentor of the project,make a jupyter notebook and complete the Task 3.
Please don't push any commits in the main branch, in that case the PR will not be accepted,as there are 4 tasks, please join the discord server first to contribute and then comment under the respective issues and then fork the repo and start working. HAPPY CONTRIBUTING!!!