🛑🛑🛑 IMPORTANT NOTE FROM 2022/04/24: ORIGAMI IS NOW LEGACY 🛑🛑🛑
Origami's segmentation model was trained on an old version of TensorFlow. To run it, you need to install a working version of TensorFlow that is no higher than 2.1.x. This proves pretty unfeasible now on basically all current OS / CPU / GPU configurations.
Origami's OCR uses Calamari v1, which proves similarly difficult to install now.
Therefore, Origami has been retired. The repository status has been changed to "archive".
Origami is a self-contained suite of batches and tools for OCR processing of historical newspapers. It covers many essential steps in a digitization pipeline, including (1) building training data for training models, and (2) generating Page-XML OCR output from pages using trained models.
Apart from its specific features, Origami is
- easy to setup
- easy to use
- based on file-based intermediary results that allow customization
Origami's current default implementation features:
- DNN segmentation
- dewarping
- reading order detection
- simple table support
- Page-XML export
Origami also provides additional tools for:
- annotating ground truth
- debugging
- creating annotated images
- evaluation of OCR quality
We provide two options for Installing Origami:
- Run in a Docker container.
- Install and run directly on your machine (in a conda environment).
Make sure you take a look at the scripts under quickstart
.
-
Download and install Docker.
-
Install the NVIDIA container toolkit (necessary for GPU usage). See here for installation instructions.
-
Build the docker container (NOTE: this process can take ~20 minutes or more, as the container builds Scikit-Geometry from source.):
cd docker docker buildx build -t "origami:origami-gpu" .
This creates a docker image
origami:origami-gpu
. -
Launch the container. You must specify the location of your local copy of the Origami repo, as shown below:
docker run --gpus all -it --rm -v /the/local/path/to/origami/:/origami origami:origami-gpu bash
This runs the container and presents you with an interactive shell, ready to run Origami (located in
/origami/
).NOTE: Origami requires some additional set-up to run (e.g., downloading the segmentation models). See below for details.
If you have access to conda, it is easiest to use the following conda descriptions:
requirements/origami_cpu.yml
requirements/origami_gpu.yml
as in, for example, conda env create -f requirements/origami_cpu.yml
.
Note that the requirements have been split into a GPU part (necessary for the Origami
segment
and ocr
stages) and a CPU part (suitable for all other Origami stages).
This simplifies dependency management with Tensorflow. Also, it is usually the split
you would go for when running this system on a cluster that is separated into GPU and
CPU nodes.
Make sure you take a look at the scripts under quickstart
.
Take a look at requirements/legacy
and try the following:
cd origami
conda create --name origami python=3.7 -c defaults -c conda-forge --file requirements/legacy/conda.txt
conda activate origami
pip install -r requirements/legacy/pip.txt
On some systems (e.g. macOS 10.15.7) the conda
installation of scikit-geometry is broken. In these cases,
you can always build scikit-geometry from scratch, i.e.:
conda activate origami
git clone https://github.com/scikit-geometry/scikit-geometry
cd scikit-geometry
python setup.py install
cd /path/to/origami
python -m origami.batch.detect.segment
All command line tools will give you help information on their arguments when called as above.
The given data path should contain processed pages as images. Generated data is put into the same path. Images may be structured into any hierarchy of sub folders.
Make sure you take a look at the scripts under quickstart
for an example of a complete pipeline.
Origami's processing happens in separated stages, with batches that read and write
information from well-defined files (also called artifacts). Each batch creates
and depends upon various artifacts, as shown in the following
table. Rows depict artifacts, columns depict detection batches (i.e. the batches
found under origami.batch.detect
). Blank circles indicate a read, filled
circles indicate a write. As illustrated here, later batches depend on information
provided by earlier batches.
Click on the names of the artifacts (left column) or batches (top row) below to get more information.
segment | contours | flow | dewarp | layout | lines | order | ocr | compose | |
---|---|---|---|---|---|---|---|---|---|
page image | ◯ | ◯ | ◯ | ◯ | ◯ | ||||
segment.zip | ⬤ | ◯ | ◯ | ◯ | ◯ | ||||
contours.0.zip | ⬤ | ◯ | ◯ | ◯ | |||||
flow.zip | ⬤ | ◯ | |||||||
lines.0.zip | ⬤ | ◯ | |||||||
contours.1.zip | ⬤ | ◯ | ◯ | ||||||
dewarp.zip | ⬤ | ◯ | |||||||
contours.2.zip | ⬤ | ◯ | ◯ | ||||||
tables.json | ⬤ | ◯ | ◯ | ◯ | |||||
contours.3.zip | ⬤ | ◯ | ◯ | ||||||
lines.3.zip | ⬤ | ◯ | ◯ | ◯ | |||||
order.json | ⬤ | ◯ | |||||||
ocr.zip | ⬤ | ◯ | |||||||
compose.zip | ⬤ |
Given an OCR model, and as illustrated in the table from last section, the necessary order of detection batches for performing OCR for a folder of documents is:
1 | segment |
2 | contours |
3 | flow |
4 | dewarp |
5 | layout |
6 | lines |
7 | order |
8 | ocr |
9 | compose |
Batch processes can be run concurrently. Origami supports file-based locking or by using a database (see --lock-strategy
). The latter strategy is more compatible and set by default.
Use --lock-database
to specify the path to a lock database (if none is specified, Origami will create one in your data folder).
It is possible to replace Origami pipeline stages/batches by custom implementations by simply reading and writing Origami's artifacts using the documented file formats.
It is also possible to run Origami stages and then postprocess the generated artifacts before continuing with later stages.
- origami.batch.detect.segment
- Performs segmentation (e.g. separation into text and background) on all images
using a neural network model.
If you have not trained a custom model, you should download and use origami’s default model. You need to specify the path to that downloaded model via the `--model` argument when calling `origami.batch.detect.segment`.
The predicted classes and labels are embedded in the specified model.
- origami.batch.detect.contours
- From the pixelwise segmentation information, detects connected components to produce vectorized polygonal contours for blocks and separator lines.
- origami.batch.detect.flow
- Detects baselines and warping in separators to produce an overall description of page curvature.
- origami.batch.detect.dewarp
- Creates a dewarping transformation that is used in subsequent stages.
- origami.batch.detect.layout
- Refines regions by fixing over- and under-segmentation via heuristic rules.
- origami.batch.detect.lines
- Detects baselines and line boundaries for each text line.
- origami.batch.detect.order
- Finds a reading order using a variant of the XY Cut algorithm.
- origami.batch.detect.ocr
- Performs OCR on each detected line using the specified Calamari OCR model. For more details on OCR models, see the section on Origami OCR models..
- origami.batch.detect.compose
- Composes text into one file using the detected reading order. Can also produce PageXML output.
- origami.batch.detect.stats
- Prints out statistics on computed artifacts and errors. This is useful for understanding how many pages for processed, and for which stages this processing is finished.
- origami.batch.annotate.contours
- Produces debug images for understanding the result of the contours batch stage.
- origami.batch.annotate.layout
- Produces debug images for understanding the result of the layout and order batch stage.
- origami.tool.sample
- Create a new annotation database by randomly sampling lines from a corpus. The details of sampling (numbers of items for each segmentation label type per page) can be specified. Allows import of transcriptions stored in accompanying PageXML. See command line help for more details.
- origami.tool.schema
- ⁂ Run an annotation normalization schema on the given ground truth text files.
- origami.tool.export
- From the given annotation database, export line images of the specified height and binarization together with accompanying ground truth text files. Annotation normalization through a schema is supported. Use this command to generate training data for Calamari. See command line for details.
- origami.tool.xycut
- Debug internal X-Y cut implementation.
- origami.batch.export.lines (debugging only)
- Export images of lines detected during lines batch.
- origami.batch.export.pagexml (debugging only)
- Export polygons of lines detected during lines batch as PageXML.
For generating ground truth for training an OCR engine from a corpus, we suggest this general process:
- Run batches up to
lines
on your page images. - Sample random lines using
origami.tool.sample
. - Fine tune your training corpus using
origami.tool.pick
(optional). - Annotate using
origami.tool.annotate
. - Export annotations using
origami.tool.export
. - Train your OCR model.
For line-based OCR, Origami uses Calamari internally and therefore can be used with any Calamari model.
However, Origami's way of segmenting lines is slightly different from other pipelines: lines are not binarized and they are not scaled horizontally (therefore they might be wider than what some models are trained on).
One model specifically trained for Origami is the model used to perform OCR on the Berliner Börsen-Zeitung. The model (and more context on its training) is available under https://github.com/poke1024/origami_models
Another suitable model is the GT4HistOCR model for Calamari. Note that you need to enable binarization in the OCR for the latter.
To evaluate performance using Dinglehopper, you probably want to use:
python -m origami.batch.utils.evaluate DATA_PATH
Alternatively, you can create PAGE XMLs manually:
python -m origami.batch.detect.compose DATA_PATH \
--page-xml --only-page-xml-regions \
--regions regions/TEXT \
--ignore-letters "{}[]"