Skip to content

Text preparation pipeline (digital witnesses) for training text recognition models. Retrieves texts from Sefaria.org, analyzes structure, cleans, concatenates and creates an index of text content. Texts are then ready for alignment search on OCR results with Passim.

Notifications You must be signed in to change notification settings

Freymat/from_Sefaria_to_Passim

Repository files navigation

Creating Ground Truth

This repository contains pipelines for creating ground truth, which we'll be using with passim and ACDC.

Pipeline steps:

  1. Retrieve texts from sefaria.org
  2. Prepare texts for use with passim
    • Analyze json file structure and process each file according to its particular structure. The code can adapt to different structures : books or corpuses of books, with simple sequences of chapters and verses, or more complex structures (nodes).
    • Clean up texts (html tags, unicodes, numbers...). Tools are provided to help you identify the elements to clean up.
    • Create an index from the structured json files. The index will contain the start and end character position of each text chunk in the book.
    • Concatenate the lines of each text. Thanks to the index, it will always be possible to identify the references of a text fragment from the concatenated text.

Note: the most elaborate pipelines are those of Talmuds, and are to be preferred if you want to evolve your code.

About

Text preparation pipeline (digital witnesses) for training text recognition models. Retrieves texts from Sefaria.org, analyzes structure, cleans, concatenates and creates an index of text content. Texts are then ready for alignment search on OCR results with Passim.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published