Skip to content

A Python script for collating LombardPress Schema compliant transcriptions.

Notifications You must be signed in to change notification settings

stenskjaer/collator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Collator is a simple and early script that assists in collating an arbitrary number of TEI XML transcriptions of a text. It uses the collation features provided by CollateX and creates a HTML document with a representation of the witnesses that is inspired by the CollateX HTML output.

It is basically a wrapper for the CollateX CLI. It converts the witnesses into plain text with a very small xslt-script (and therefore also uses saxon -- maybe I could cut that dependency). It then reads those witnesses into a JSON temporary file that it feeds to CollateX which returns a nested list that it processes into a HTML representation.

This is developed to handle LombardPress Schema compliant material, but it might handle many other TEI documents well for now, as the encoding conventions of the document are not central to it.

Installation

Requirements

  • Python 3.6
  • Java Runtime Environment

Vendored binaries

The script uses saxon for XML processing and CollateX for collation. The binaries of those are included in the vendor directory, so no installation is required for that.

But you do need to have a functional Java Runtime Environment.

Run without any installation

The only external dependency right now is the wonderful docopt module. I want to shred this dependency, but for now, the script needs it.

Before you install anything, you should probably create a virtual environment for the project. To do that, run:

$ mkvirtualenv -p python3 <name>

Where <name> is the name you want to give the venv.

After activating the venv (workon or source), install dependencies:

$ pip install -r requirements.txt

Now you can run the script from its directory with ./collator.py.

Install development version for testing

If you want to try it out, and maybe fiddle with the script yourself (PR's are very welcome!), I would recommend creating a virtual environment and install the script in development mode:

$ pip3 install -e .

Notice the dot! Now you can run $ collator.py from anywhere.

By using the -e the script is install with a symlink to the file so that changes are immediately available in the command line script without updating the install.

Once you leave the virtual environment, the script is no longer available. When you delete the environment, it's gone.

Install permanently

If you just want to be able to use the script at any time, from the directory of the script, run:

pip3 install .

Now collator.py should be globally available.

Usage

To see some quick results, run:

$ ./collator.py examples/bal311_da-49-l1prooemium.xml examples/oriel33_da-49-l1prooemium.xml

It will result in the html that is already in the examples-directory.

The usage statement:

Usage: collator.py [options] <file> <file>...

A script for simplifying collation of several text witnesses encoded according
to the Lombard Press Schema.

Arguments:
  <file> <file>...        Two or more files that are to be collated.

Options:
  -o, --output <file>     Location of the output file. [default: ./output.html].
  -V, --verbosity <level> Set verbosity. Possibilities: silent, info, debug [default: info].
  -v, --version           Show version and exit.
  -h, --help              Show this help message and exit.

The input files must be XML files. They will be converted to plain text during processing. The following elements will be preserved in the plain text for later analysis:

  • unclear
  • pb
  • supplied
  • secl
  • del
  • add

When a word is normalized with <choice><orig>sicud</orig><reg>sicut</reg></choice>, the regularized form is used.

Warning

The script is volatile. Anything may be subject to change, and I provide no warranties for the safety of your texts, computer equipment or soul when using the script.

About

A Python script for collating LombardPress Schema compliant transcriptions.

Resources

Stars

Watchers

Forks

Packages

No packages published