Page segmentation

Detecting text and images on heritage documents

Keywords: page segmentation, document layout analysis, text line detection

Approaches: convolutional neural networks, synthetic data

Tools: docExtractor, dhSegment

Example

Goals

Page segmentation is the process by which a digital image of a document page is divided into columns and blocks which are then classified as illustrations, text, tables, etc.

Automatic extraction of illustrations and text allows information retrieval scenarii, document analysis for digital humanities objectives, etc.

Educational resources

docExtractor

docExtractor is a generic approach for extracting visual elements such as text lines or illustrations from historical documents. It can be used as an off the shelf system or fine-tuned on specific dataset. It relies on a fast generator of rich synthetic documents for the training and a fully convolutional network for the extraction. See this github.

EnHerit (Enhancing Heritage Image Databases) project, ANR (2018-2022), LIGM Laboratoire d'Informatique Gaspard-Monge, France

dhSegment

dhSegment is a tool for Historical Document Processing. Its generic approach allows to segment regions and extract content from different type of documents. See this github.

DHLAB-EPFL, Switzerland

OCR

All OCR engines incorporate pre-processing steps, one of which is segmentation (e.g. Tesseract).

Other resources

Probabilistic homogeneity for document image segmentation (TanLu, AnnDooms)
Historical Book Analysis Competition (HBA), ICDAR 2019. Dataset on api.bnf.fr.

Implementations

NLP

HTR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page segmentation

Page segmentation

Goals

Educational resources

docExtractor

dhSegment

OCR

Other resources

Implementations

NLP

HTR

OCR

Document Analysis

Computer Vision

Clone this wiki locally