Skip to content

LIMA Python User Manual

Gaël de Chalendar edited this page Jun 24, 2024 · 3 revisions

Table of Contents generated with DocToc

This documents how to use the LIMA python package, including the new experimental libtorch-based modules.

Installation

LIMA python bindings are currently available under Linux only (x86_64).

Under Linux with python >= 3.7 and < 4, and upgraded pip:

# Upgrading pip is fundamental in order to obtain the correct LIMA version
$ pip install --upgrade pip
$ pip install aymara==0.5.0b6
$ lima_models.py -l eng
# Either simply use the lima command to produce an analysis of a file in CoNLLU format:
$ lima <path to the file to analyse>
# Or use the python API:
$ python
>>> import aymara.lima
>>> nlp = aymara.lima.Lima("ud-eng")
>>> doc = nlp('Hello, World!')
>>> print(doc[0].lemma)
hello
>>> print(repr(doc))
1       Hello   hello   INTJ    _       _               0       root    _       Pos=0|Len=5
2       ,       ,       PUNCT   _       _               1       punct   _       Pos=5|Len=1
3       World   World   PROPN   _       Number:Sing     1       vocative        _       Pos=7|Len=5
4       !       !       PUNCT   _       _               1       punct   _       Pos=12|Len=1

Running LIMA for the first time

First of all, you must ensure that you have installed the models for the language you want to analyze. Legacy TensorFlow-based models handling is done with the lima_models command:

usage: lima_models [-h] [-a] [-i INSTALL] [-d DEST] [-r REMOVE] [-s SELECT] [-f] [-l]

options:
  -h, --help            show this help message and exit
  -a, --avail           print list of available languages and exit
  -i INSTALL, --install INSTALL
                        install model for the given language name or language code
                        (example: 'english' or 'eng')
  -d DEST, --dest DEST  destination directory
  -r REMOVE, --remove REMOVE
                        delet model for the given language name or language code
                        (example: 'english' or 'eng')
  -s SELECT, --select SELECT
                        select particular models to install: tokenizer, morphosyntax,
                        lemmatizer (comma-separated list)
  -f, --force           force reinstallation of existing files
  -l, --list            list installed models

So,

  • to check installed models: lima_models -l
  • to list available models: lima_models -a
  • to install models for e.g. Tamil: lima_models -i tam

Choose now UTF-8 encoded text files in one of the installed models languages (English in this example) and run the following commands in a terminal or command prompt:

cd /path/to/your/text/files/folder

lima -l ud-eng -p deepud file.txt[^1]

This will write the result of the analysis on standard output in CoNLL-U Plus format. The table below is from the former Web site but adapted for LIMA :

Field number Field name Description
1 ID Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
2 FORM Word form or punctuation symbol.
3 LEMMA Lemma or stem of word form, or an underscore if not available.
4 UPOS Part-of-speech tag.
5 XPOS Language-specific part-of-speech tag; underscore if not available.
6 FEATS List of morphological features from the universal feature inventory.
7 HEAD Head of the current word, which is either a value of ID or zero (0).
8 DEPREL Dependency relation to the HEAD.
9 DEPS Enhanced dependency graph in the form of a list of head-deprel pairs, which is an underscore as it is not available in LIMA.
10 MISC Any other annotation. Pipe-separated list of key=value pairs. In LIMA there is always Pos (position) and Len (length).

The MISC field includes annotations for named entities. In this case, the key is "NE" and the value the type of the entity. Other field are Pos and Len for the token absolute position and length in the text. And SpaceAfter=No if there is no space between this token and the next one.

Using the LIMA Python module

The LIMA python API is documented on ReadTheDocs.

DeepLima: experimental libtorch-based models

DeepLima is the future version of LIMA. Available models are already way better and quicker than leagacy TensorFlow-based models. But no dependency parser model is available currently. That's why, to use them, you have to use both kind of models, which makes the process large and slower than you could expect.

It works great anyway and results are better than with legacy models.

To install these new models, you must use the deeplima_models command:

usage: deeplima_models [-h] [-a] [-i INSTALL] [-d DEST] [-r REMOVE] [-f] [-l]

options:
  -h, --help            show this help message and exit
  -a, --avail           print list of available languages and exit
  -i INSTALL, --install INSTALL
                        install model for the given corpus name or language code
                        (example: 'UD_English-EWT' or 'eng')
  -d DEST, --dest DEST  destination directory
  -r REMOVE, --remove REMOVE
                        delete models for the given corpus name or language code
                        (example: 'UD_English-EWT' or 'eng')
  -f, --force           destructive actions (overwriting, removing) without
                        confirmation
  -l, --list            list installed models

For example, first check models available:

❯ deeplima_models -a
Downloading https://huggingface.co/aymaralima/deeplima/resolve/main/langlist.json
100%|█████████████████████████████████████████████| 72.9k/72.9k [00:00<00:00, 5.78MiB/s]
afr             UD_Afrikaans-AfriBooms
[…]
eme             UD_Teko-TuDeT
eng             UD_English-EWT, UD_English-Atis, UD_English-ESL, UD_English-GUM, UD_English-GUMReddit, UD_English-LinES, UD_English-ParTUT, UD_English-Pronouns, UD_English-PUD
spa             UD_Spanish-AnCora, UD_Spanish-GSD, UD_Spanish-PUD
[…]

And decide to install the models learnt from the UD_English-EWT corpus:

$ deeplima_models -i UD_English-EWT
Downloading https://huggingface.co/aymaralima/deeplima/resolve/main/langlist.json
100%|██████████████████████████████████████████████| 72.9k/72.9k [00:00<00:00, 989kiB/s]
install_language code: eng, corpus: UD_English-EWT
Code: eng, corpus: UD_English-EWT
Installation dir: /home/gael/.local/share/lima/resources
Downloading https://huggingface.co/aymaralima/deeplima/resolve/main/eng-UD_English-EWT.zip
100%|███████████████████████████████████████████████| 579M/579M [01:03<00:00, 9.08MiB/s]

But don't forget that you currently need also the legacy models:

You can now use the new models. To do it you must pass additional parameters:

  • -l ud: chose the ud language
  • --meta: pass the udlang metadata with then language trigram and the corpus name
  • -p deeplima: use the deeplima pipeline
lima -l ud --meta udlang:eng-UD_English-EWT -p deeplima </path/to/the/files/to/analyze>