Skip to content

A collection of about 12k Marathi word images with corresponding labels, useful for Devanagari Optical Character Recognition.

Notifications You must be signed in to change notification settings

sayalighodekar/Marathi-OCR-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Marathi-OCR-Dataset

A collection of about 12k Marathi word images with corresponding labels, useful for Devanagari Optical Character Recognition.

Description

There is a lack of publicly available datasets at word/line level for Devanagari character recognition. We created this dataset containing Marathi vocabulary of ~12k word images and thier corresponding text labels encoded in utf-8 format. Words are segmented from Marathi books in PDF and .epub format, available at http://www.esahity.com/ . We used 12 books from different genres to include diversity in vocabulary and font variation. It also removed the dependancy on domain specific words and redundant Marathi numerals. The dataset is based on IAM Handwriting Dataset

We created this dataset using pytesseract which is a wrapper for Google Tesseract-OCR engine. The challenge with using Tesseract-OCR for Indic languages is that, it is trained using the same approach as European languages. It fails to recognize compound words (common in Devanagari script) which are consonant-vowel sequences represented as a single unit. Devanagari script also contains various diacritics written with the characters. To eliminate these errors and inconsistencies in the predicted output, we manually correct the text labels. The images are resized and stored in JPEG format with a resolution of 96 dpi horizontally and vertically, so that they can be fed directly to neural nets. The length of words varies from 2 characters upto as long as 15-20 characters. We also apply pre-processing techniques like binarization and image thresholding using OpenCV and PIL, for cleaner images.

Usage

The file images.zip contains the Marathi words images and labels.txt contains the corresponding text. This dataset can be leveraged to improve the existing OCR systems, see Train Tesseract 4.0. More widely, it can be used to train hybrid CNN-LSTM models from scratch, see Text Recognition System using TensorFlow.

About

A collection of about 12k Marathi word images with corresponding labels, useful for Devanagari Optical Character Recognition.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published