Skip to content

Page segmentation

CENL-AI-WG edited this page Mar 4, 2021 · 16 revisions

Page segmentation Statut

Detecting text and images on heritage documents

Keywords: page segmentation, document layout analysis, text line detection

Approaches: convolutional neural networks, synthetic data

Tools: docExtractor, dhSegment


Example

Goals

Page segmentation is the process by which a digital image of a document page is divided into columns and blocks which are then classified as illustrations, text, tables, etc.

Automatic extraction of illustrations and text allows information retrieval scenarii, document analysis for digital humanities objectives, etc.

Educational resources

docExtractor

docExtractor is a generic approach for extracting visual elements such as text lines or illustrations from historical documents. It can be used as an off the shelf system or fine-tuned on specific dataset. It relies on a fast generator of rich synthetic documents for the training and a fully convolutional network for the extraction. See this github.

EnHerit (Enhancing Heritage Image Databases) project, ANR (2018-2022), LIGM Laboratoire d'Informatique Gaspard-Monge, France

dhSegment

dhSegment is a tool for Historical Document Processing. Its generic approach allows to segment regions and extract content from different type of documents. See this github.

DHLAB-EPFL, Switzerland

OCR

All OCR engines incorporate pre-processing steps, one of which is segmentation (e.g. Tesseract).

Other resources

Implementations