-
Notifications
You must be signed in to change notification settings - Fork 21
LIMA Python User Manual
Table of Contents generated with DocToc
- Installation
- Running LIMA for the first time
- Using the LIMA Python module
- DeepLima: experimental libtorch-based models
This documents how to use the LIMA python package, including the new experimental libtorch-based modules.
LIMA python bindings are currently available under Linux only (x86_64).
Under Linux with python >= 3.7 and < 4, and upgraded pip:
# Upgrading pip is fundamental in order to obtain the correct LIMA version
$ pip install --upgrade pip
$ pip install aymara==0.5.0b6
$ lima_models.py -l eng
# Either simply use the lima command to produce an analysis of a file in CoNLLU format:
$ lima <path to the file to analyse>
# Or use the python API:
$ python
>>> import aymara.lima
>>> nlp = aymara.lima.Lima("ud-eng")
>>> doc = nlp('Hello, World!')
>>> print(doc[0].lemma)
hello
>>> print(repr(doc))
1 Hello hello INTJ _ _ 0 root _ Pos=0|Len=5
2 , , PUNCT _ _ 1 punct _ Pos=5|Len=1
3 World World PROPN _ Number:Sing 1 vocative _ Pos=7|Len=5
4 ! ! PUNCT _ _ 1 punct _ Pos=12|Len=1
First of all, you must ensure that you have installed the models for the language you want to analyze. Legacy TensorFlow-based models handling is done with the lima_models
command:
usage: lima_models [-h] [-a] [-i INSTALL] [-d DEST] [-r REMOVE] [-s SELECT] [-f] [-l]
options:
-h, --help show this help message and exit
-a, --avail print list of available languages and exit
-i INSTALL, --install INSTALL
install model for the given language name or language code
(example: 'english' or 'eng')
-d DEST, --dest DEST destination directory
-r REMOVE, --remove REMOVE
delet model for the given language name or language code
(example: 'english' or 'eng')
-s SELECT, --select SELECT
select particular models to install: tokenizer, morphosyntax,
lemmatizer (comma-separated list)
-f, --force force reinstallation of existing files
-l, --list list installed models
So,
- to check installed models:
lima_models -l
- to list available models:
lima_models -a
- to install models for e.g. Tamil:
lima_models -i tam
Choose now UTF-8 encoded text files in one of the installed models languages (English in this example) and run the following commands in a terminal or command prompt:
cd /path/to/your/text/files/folder
lima -l ud-eng -p deepud file.txt
[^1]
This will write the result of the analysis on standard output in CoNLL-U Plus format. The table below is from the former Web site but adapted for LIMA :
Field number | Field name | Description |
---|---|---|
1 | ID | Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0). |
2 | FORM | Word form or punctuation symbol. |
3 | LEMMA | Lemma or stem of word form, or an underscore if not available. |
4 | UPOS | Part-of-speech tag. |
5 | XPOS | Language-specific part-of-speech tag; underscore if not available. |
6 | FEATS | List of morphological features from the universal feature inventory. |
7 | HEAD | Head of the current word, which is either a value of ID or zero (0). |
8 | DEPREL | Dependency relation to the HEAD. |
9 | DEPS | Enhanced dependency graph in the form of a list of head-deprel pairs, which is an underscore as it is not available in LIMA. |
10 | MISC | Any other annotation. Pipe-separated list of key=value pairs. In LIMA there is always Pos (position) and Len (length). |
The MISC field includes annotations for named entities. In this case, the key is "NE" and the value the type of the entity. Other field are Pos and Len for the token absolute position and length in the text. And SpaceAfter=No if there is no space between this token and the next one.
The LIMA python API is documented on ReadTheDocs.
DeepLima is the future version of LIMA. Available models are already way better and quicker than leagacy TensorFlow-based models. But no dependency parser model is available currently. That's why, to use them, you have to use both kind of models, which makes the process large and slower than you could expect.
It works great anyway and results are better than with legacy models.
To install these new models, you must use the deeplima_models
command:
usage: deeplima_models [-h] [-a] [-i INSTALL] [-d DEST] [-r REMOVE] [-f] [-l]
options:
-h, --help show this help message and exit
-a, --avail print list of available languages and exit
-i INSTALL, --install INSTALL
install model for the given corpus name or language code
(example: 'UD_English-EWT' or 'eng')
-d DEST, --dest DEST destination directory
-r REMOVE, --remove REMOVE
delete models for the given corpus name or language code
(example: 'UD_English-EWT' or 'eng')
-f, --force destructive actions (overwriting, removing) without
confirmation
-l, --list list installed models
For example, first check models available:
❯ deeplima_models -a
Downloading https://huggingface.co/aymaralima/deeplima/resolve/main/langlist.json
100%|█████████████████████████████████████████████| 72.9k/72.9k [00:00<00:00, 5.78MiB/s]
afr UD_Afrikaans-AfriBooms
[…]
eme UD_Teko-TuDeT
eng UD_English-EWT, UD_English-Atis, UD_English-ESL, UD_English-GUM, UD_English-GUMReddit, UD_English-LinES, UD_English-ParTUT, UD_English-Pronouns, UD_English-PUD
spa UD_Spanish-AnCora, UD_Spanish-GSD, UD_Spanish-PUD
[…]
And decide to install the models learnt from the UD_English-EWT
corpus:
$ deeplima_models -i UD_English-EWT
Downloading https://huggingface.co/aymaralima/deeplima/resolve/main/langlist.json
100%|██████████████████████████████████████████████| 72.9k/72.9k [00:00<00:00, 989kiB/s]
install_language code: eng, corpus: UD_English-EWT
Code: eng, corpus: UD_English-EWT
Installation dir: /home/gael/.local/share/lima/resources
Downloading https://huggingface.co/aymaralima/deeplima/resolve/main/eng-UD_English-EWT.zip
100%|███████████████████████████████████████████████| 579M/579M [01:03<00:00, 9.08MiB/s]
But don't forget that you currently need also the legacy models:
You can now use the new models. To do it you must pass additional parameters:
-
-l ud
: chose the ud language -
--meta
: pass theudlang
metadata with then language trigram and the corpus name -
-p deeplima
: use thedeeplima
pipeline
lima -l ud --meta udlang:eng-UD_English-EWT -p deeplima </path/to/the/files/to/analyze>
Table of Contents generated with DocToc