-
Notifications
You must be signed in to change notification settings - Fork 0
Page segmentation
Detecting text and images on heritage documents
Keywords: page segmentation, document layout analysis, text line detection
Approaches: convolutional neural networks, synthetic data
Tools: docExtractor, dhSegment
Page segmentation is the process by which a digital image of a document page is divided into columns and blocks which are then classified as illustrations, text, tables, etc.
Automatic extraction of illustrations and text allows information retrieval scenarii, document analysis for digital humanities objectives, etc.
docExtractor is a generic approach for extracting visual elements such as text lines or illustrations from historical documents. It can be used as an off the shelf system or fine-tuned on specific dataset. It relies on a fast generator of rich synthetic documents for the training and a fully convolutional network for the extraction. See this github.
EnHerit (Enhancing Heritage Image Databases) project, ANR (2018-2022), LIGM Laboratoire d'Informatique Gaspard-Monge, France
dhSegment is a tool for Historical Document Processing. Its generic approach allows to segment regions and extract content from different type of documents. See this github.
DHLAB-EPFL, Switzerland
All OCR engines incorporate pre-processing steps, one of which is segmentation (e.g. Tesseract).