Skip to content

cldf/pyigt

Repository files navigation

pyigt: Handling interlinear glossed text with Python

Build Status PyPI Documentation Status

This library provides easy access to Interlinear Glossed Text (IGT) according to the Leipzig Glossing Rules, stored as CLDF examples.

Installation

Installing pyigt via pip

pip install pyigt

will install the Python package along with a command line interface igt.

Note: The methods Corpus.get_wordlist and Corpus.get_profile, to extract a wordlist and an orthography profile from a corpus, require the lingpy package. To make sure it is installed, install pyigt as

pip install pyigt[lingpy]

CLI

$ igt -h
usage: igt [-h] [--log-level LOG_LEVEL] COMMAND ...

optional arguments:
  -h, --help            show this help message and exit
  --log-level LOG_LEVEL
                        log level [ERROR|WARN|INFO|DEBUG] (default: 20)

available commands:
  Run "COMAMND -h" to get help for a specific command.

  COMMAND
    ls                  List IGTs in a CLDF dataset
    stats               Describe the IGTs in a CLDF dataset

The igt ls command allows inspecting IGTs from the commandline, formatted using the four standard lines described in the Leipzig Glossing Rules, where analyzed text and glosses are aligned, e.g.

$ igt ls tests/fixtures/examples.csv 
Example 1:
zəple: ȵike: peji qeʴlotʂuʁɑ,
zəp-le:       ȵi-ke:       pe-ji       qeʴlotʂu-ʁɑ,
earth-DEF:CL  WH-INDEF:CL  become-CSM  in.the.past-LOC

...

Example 5:
zuɑməɸu oʐgutɑ ipiχuɑȵi,
zuɑmə-ɸu      o-ʐgu-tɑ    i-pi-χuɑ-ȵi,
cypress-tree  one-CL-LOC  DIR-hide-because-ADV

IGT corpus at tests/fixtures/examples.csv

igt ls can be chained with other commandline tools such as commands from the csvkit package for filtering:

$ csvgrep -c Primary_Text -m"ȵi"  tests/fixtures/examples.csv | csvgrep -c Gloss -m"ADV" |  igt ls -
Example 5:
zuɑməɸu oʐgutɑ ipiχuɑȵi,
zuɑmə-ɸu      o-ʐgu-tɑ    i-pi-χuɑ-ȵi,
cypress-tree  one-CL-LOC  DIR-hide-because-ADV

Python API

The Python API is documented in detail at readthedocs. Below is a quick overview.

You can read all IGT examples provided with a CLDF dataset

>>> from pyigt import Corpus
>>> corpus = Corpus.from_path('tests/fixtures/cldf-metadata.json')
>>> len(corpus)
5
>>> for igt in corpus:
...     print(igt)
...     break
... 
zəple: ȵike: peji qeʴlotʂuʁɑ,
zəp-le:       ȵi-ke:       pe-ji       qeʴlotʂu-ʁɑ,
earth-DEF:CL  WH-INDEF:CL  become-CSM  in.the.past-LOC

or instantiate individual IGT examples, e.g. to check for validity:

>>> from pyigt import IGT
>>> ex = IGT(phrase="palasi=lu", gloss="priest-and")
>>> ex.check(strict=True, verbose=True)
palasi=lu
priest-and
...
ValueError: Rule 2 violated: Number of morphemes does not match number of morpheme glosses!

or to expand known gloss abbreviations:

>>> ex = IGT(phrase="Gila abur-u-n ferma hamišaluǧ güǧüna amuq’-da-č.",
...          gloss="now they-OBL-GEN farm forever behind stay-FUT-NEG", 
...          translation="Now their farm will not stay behind forever.")
>>> ex.pprint()
Gila aburun ferma hamišaluǧ güǧüna amuqdač.
Gila    abur-u-n      ferma    hamišaluǧ    güǧüna    amuq-da-č.
now     they-OBL-GEN  farm     forever      behind    stay-FUT-NEGNow their farm will not stay behind forever.’
  OBL = oblique
  GEN = genitive
  FUT = future
  NEG = negation, negative

And you can go deeper, parsing morphemes and glosses according to the LGR (see module pyigt.lgrmorphemes):

>>> igt = IGT(phrase="zəp-le: ȵi-ke: pe-ji qeʴlotʂu-ʁɑ,", gloss="earth-DEF:CL WH-INDEF:CL become-CSM in.the.past-LOC")
>>> igt.conformance
<LGRConformance.MORPHEME_ALIGNED: 2>
>>> igt[1, 1].gloss
<Morpheme "INDEF:CL">
>>> igt[1, 1].gloss.elements
[<GlossElement "INDEF">, <GlossElementAfterColon "CL">]
>>> igt[1, 1].morpheme
<Morpheme "ke:">
>>> print(igt[1, 1].morpheme)
ke:

See also

  • interlineaR - an R package with similar functionality, but support for more input formats.