Skip to content

fmalato/HolySheet

Repository files navigation

HolySheet

Word crawler for ancient documents. This software provides methods for binarization and word segmentation of ancient document, like Genesis (from Holy Bible). In particular it can:

  • Elaborate .png scans of ancient document with python libraries like openCV

  • Get word segmentation, using histogram pixel techniques and specific heuristics

  • Prepare a dataset for Detectron neural network

To achieve that we process an image in several ways: first, we perform a rotation in order to straighten it up. Through OpenCV's threshold() method, our image becomes binarized.

Original Genesis image and his binarization.

Then, we make a histogram so that we can obtain a segmentation of the single line.

Line segmentation.

Finally, we repeat the histogram and we affine the results through another method call, calimero(), which helps us spotting the periods and the full stops. This way we obtain the single word segmentation.

Word segmentation.

Moreover the software can split the images into sub-images and generate the annotations for them, making them suitable for a COCO-based neural network.

About

Word crawler for ancient documents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages