Skip to content
Gaël de Chalendar edited this page Jun 24, 2024 · 47 revisions

Table of Contents generated with DocToc

This documents explains how to install, run, configure and integrate into applications LIMA, the CEA LIST Multilingual Analyzer. It also explains how to use the LIMA python package.

Installation

Installers and binary packages

We provide packages for two different Ubuntu GNU/Linux versions and Microsoft Windows.

We also provide a python module available in PyPI:

$ pip install aymara==0.5.0b6

See below for its use.

Sources compilation

See the INSTALL file at the root of LIMA sources.

Running LIMA for the first time

The simplest way to start is to use the LIMA graphical user interface by running the command lima from the shell or by clicking on lima.exe under Windows. Hit the 'Analyze some text' button in the top toolbar (Note that the 'Analyze file' button does not work under Windows currently). Write or paste some text. Then select 'English', 'CONLL Format' and 'main' in the bottom drop-down menus. Then hit the 'Analyze' button. You should get something like that:

LIMA GUI snapshot

As you can see, you obtain for each token (word) its part of speech, its type if it is a named entity and its syntactic dependee (HEAD and DEPREL columns).

This GUI is still very simple and to exploit LIMA for a real use, you will have to use command-line programs and to manually edit its configuration files. Let's have a look at that...

First of all, you must ensure that you have installed the models for the language you want to analyze. All models handling is done with the lima_models.py script:

usage: lima_models.py [-h] [-i] [-l LANG] [-d DEST] [-s SELECT] [-f] [-L]

optional arguments:
  -h, --help            show this help message and exit
  -i, --info            print list of available languages and exit
  -l LANG, --lang LANG  install model for the given language name or language code (example: 'english' or 'eng')
  -d DEST, --dest DEST  destination directory
  -s SELECT, --select SELECT
                        select particular models to install: tokenizer, morphosyntax, lemmatizer (comma-separated list)
  -f, --force           force reinstallation of existing files
  -L, --list            list installed models

So,

  • to check installed models: lima_models.py -L
  • to list available models: lima_models.py -i
  • to install models for e.g. Tamil: lima_models.py -l tam

Models for English and French should be installed by default (because English is the most used language in NLP and French is original LIMA authors mother tongue).

Choose now UTF-8 encoded text files in one of the installed models languages (English in this example) and run the following commands in a terminal or command prompt:

cd /path/to/your/text/files/folder

analyzeText -l ud-eng -p deepud file.txt[^1]

This will write the result of the analysis on standard output in CoNLL-U Plus format. The table below is from the former Web site but adapted for LIMA :

Field number Field name Description
1 ID Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
2 FORM Word form or punctuation symbol.
3 LEMMA Lemma or stem of word form, or an underscore if not available.
4 UPOS Part-of-speech tag. Will be in the future a Universal Part of Speech tag but is currently specific to LIMA.
5 XPOS Language-specific part-of-speech tag; underscore if not available.
6 FEATS List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available. Currently unavailable in LIMA but the necessary data is internaly available.
7 HEAD Head of the current word, which is either a value of ID or zero (0).
8 DEPREL Dependency relation to the HEAD. The set of dependency relations depends on the particular language.
9 DEPS Enhanced dependency graph in the form of a list of head-deprel pairs, which is an underscore as it is not available in LIMA.
10 MISC Any other annotation. Pipe-separated list of key=value pairs.

The MISC field includes annotations for named entities. In this case, the key is "NE" and the value the type of the entity. Other field are Pos and Len for the token absolute position and length in the text. And SpaceAfter=No if there is no space between this token and the next one.

Tokens can be simple, idioms, named entities, etc.

You can also produce a "bag of words" (BoW) binary representation of the file by changing the dumper executed at the end of the analysis pipeline: replace "conllDumper" by "bowDumper" in the file $LIMA_CONF/lima-lp-fre.xml and pass the parameter "-d bow" to analyzeText. You can view the content of the binary file which is the produced using the readBowFile command:

readBowFile file.txt.bin

This kind of file contains lemmas of simple terms (nouns, adjectives and verbs) and representation of named entities found in the text. Suppose that the content of the text file was "On 4th April 2011, Bill Williams looked at Paris from his home window.". Then, the output of the readBowFile command will be:

    (4th_April_2011-8192-4)->[*(4th-8192-4)(April-16384-8)(2011-8192-14)]:DateTime.DATE:date=2011-04-04;value=4th April 2011
    (Bill_Williams-16384-20)->[*(Bill-16384-20)(Williams-16384-25)]:Person.PERSON:firstname=Bill;lastname=Williams;value=Bill Williams
    (look-49152-34)
    (Paris-16384-44)->[*(Paris-16384-44)]:Location.LOCATION:value=Paris
    (home-8192-59)
    (window-8192-64)

As you can see, dates, person names and locations are recognized as such. Also, each line describes a recognized term, either simple (the verb "to look") or complex. Complex terms are named entities ("Paris", "Bill Williams", "4th April 2011"). Each term is described by its normalized form, the numerical value of its category (more on this later) and its position in the text. This is followed by the details of its structure.

This text representation of BoW is not designed to be used later by other programs. It's just to get an idea of its content. A dedicated C++ API allows to manipulate bags of words.

Using the LIMA Python module

# Upgrading pip is fundamental in order to obtain the correct LIMA version
$ pip install --upgrade pip
$ pip install aymara==0.5.0b6
$ lima_models.py -l eng
# Either simply use the lima command to produce an analysis of a file in CoNLLU format:
$ lima <path to the file to analyse>
# Or use the python API:
$ python
>>> import aymara.lima
>>> nlp = aymara.lima.Lima("ud-eng")
>>> doc = nlp('Hello, World!')
>>> print(doc[0].lemma)
hello
>>> print(repr(doc))
1       Hello   hello   INTJ    _       _               0       root    _       Pos=0|Len=5
2       ,       ,       PUNCT   _       _               1       punct   _       Pos=5|Len=1
3       World   World   PROPN   _       Number:Sing     1       vocative        _       Pos=7|Len=5
4       !       !       PUNCT   _       _               1       punct   _       Pos=12|Len=1

The LIMA python API is documented on ReadTheDocs.

Running the LIMA Docker container

If you have installed the LIMA Docker container (see the Install page), you can run LIMA either as a simple executable (analyzeText), a GUI (lima) or a server (limaserver).

To run the GUI, execute:

docker run -e DISPLAY=:0 -v /tmp/.X11-unix:/tmp/.X11-unix aymara/lima-python3.7 lima

For example, to make LIMA accessible from anywhere, start limaserver like that:

docker run -p <host-ip>:8080:8080/tcp aymara/lima-python3.7 limaserver

The LIMA server will be accessible on the port 8080 of the host. For example, run the command:

curl http://<host-ip>:8080/?lang=ud-eng\&pipeline=deepud --data-binary @file.txt

To change the configuration of LIMA inside the container, you can create your own configuration files in a dedicated folder of your host, mount this folder in your container and set the LIMA_CONF variable point to it, like in other parts of this manual. To do that, please refer to the Docker manual.

Configuring LIMA

To configure LIMA for your own needs, you will have to copy the configuration and resources files you want to modify in dedicated folders and define the environment variables LIMA_CONF and LIMA_RESOURCES with these new folders before system ones (/usr by LIMA install prefix if you built it from sources):

install -d ~/MyLima/conf
install -d ~/MyLima/resources
export LIMA_CONF=~/MyLima/conf:/usr/share/config/lima
export LIMA_RESOURCES=~/MyLima/resources:/usr/share/apps/lima/resources

The "LIMA Technical Documentation" page describes in details the various configuration possibilities, but suppose for now that you don't need the named entities extracted by LIMA in English. Then, you only copy the lima-lp-eng.xml file to your MyLima/conf folder and edit it by commenting out the following line in the main group of the Processors module:

<item value="SpecificEntitiesModex"/>

Troubleshouting the configuration

If your configuration files seems to be ignored, it could be because another file is read instead. To check in which order configuration files are searched, just define the LIMA_SHOW_CONFIG_PATH environment variable to a non-empty string and the list of the searched folders will be displayed to the console.

Linguistic processing steps

Linguistic processing steps in LIMA are called process units. They are executed one after the other in pipeline. There can be several different pipelines (defined in the configuration files) for different uses. The process units functions are described in a dedicated page. They also have a reference documentation page.

Note that there is dependencies between process units but that these dependencies are not explicit nor automatically checked in the current version. Thus, inactivating one process unit can have as a consequence to make LIMA stop in error or even crash. This will have to be corrected in future versions.

Creating a Modex: extracting new kind of entities

Introduction

A Modex ("Module d'Extraction") is a set of compiled regular expression-like rules with their accompanying configuration file. It is the base tool in Lima for various things, including idiomatic expression recognizing and named entities extraction but also parsing in legacy languages. In UD-based pipelines, parsing uses deep learning models trained on Universal Dependencies corpora. You can create you own Modexes to extract entities specific for your application. For example, Twitter ids and Twitter hash tags are not natively supported by Lima. So, if your application is targeted at analyzing Tweets, then you will have to write your own Modex to extract them.

TwitterModex

The configuration file

There is only one configuration file for all languages supported by the Modex (e.g.: Twitter-modex.xml). It must be installed in your configuration directory listed in the LIMA_CONF environment variable (see above). It contains three modules defining:

  • groups and entity types;
  • processing units (processUnit);
  • resources to use for each language;

Groups and Types

This first module, named "entities" contains a group for each entities group and, in this group, the list (named entityList) of the entity types. For example:

<module name="entities">
    <group name="Twitter">
      <list name="entityList">
        <item value="TWITTERID"/>
        <item value="TWITTERHASH"/>
      </list>
    </group>
  </module>

Process units

This module, named Processors, defines the processing units groups available for this Modex. These processing units can be pipelines (class ProcessUnitPipeline), which allows to define a global process unit for the Modex chaining up several rules application.

  <module name="Processors">
    <group name="TwitterModex" class="ProcessUnitPipeline" >
      <list name="processUnitSequence">
        <item value="TwitterRecognition"/>
      </list>
    </group>
    <group name="TwitterRecognition" class="ApplyRecognizer">
      <param key="automaton" value="TwitterRules"/>
      <param key="applyOnGraph" value="AnalysisGraph"/>
      <param key="useSentenceBounds" value="no"/>
    </group>
  </module>

As in the analysis configuration file, each process unit is defined in its own group with its parameters. For an ApplyRecognizer process unit, these parameters are:

  • automaton: the name of a resource defined later in the resources specific to each language (Cf. next section). Can be absent if "automatonList" is defined;
  • automatonList: a list of resource names defined later in the resources specific to each language. Ignored if the "automaton" parameter is defined;
  • applyOnGraph: declare the name of the analysis graph on which to apply the rules (AnalysisGraph if before part-of-speech tagging or PosGraph after). default to "PosGraph";
  • useSentenceBounds: (yes or no) defines if the automaton will be applied bbetween each sentence limits or on the whole graph. Use "no" if this Modex will be used before the "sentenceBoundariesFinder" process unit. Defaults to no;
  • updateGraph: (yes or no, optional) Defaults to no;
  • resolveOverlappingEntities: (yes or no, optional) Defaults to no;
  • overlappingEntitiesStrategy: (IgnoreSmallest (default), IgnoreFirst or IgnoreSecond, optional)
  • testAllVertices: (yes or no, optional) Defaults to no;
  • stopAtFirstSuccess: (yes or no, optional) Defaults to yes;
  • onlyOneSuccessPerType: (yes or no, optional) Defaults to no;
  • storeInData: (default to empty string)

The details of each process unit inputs, outputs, dependencies and configuration are described in their reference documentation page.

Resources

The resources definition modules for each language are called resources-xyz, with xyz the language trigram: <module name=“resources-xyz”>. They contain a group for each automaton defined above. This group, of the class AutomatonRecognizer defines the extractor parameters and particularly the path to the compiled rules file (relative to the global Lima resources directories listed in LIMA_RESOURCES):

<group name="TwitterRules" class="AutomatonRecognizer">
      <param key="rules" value="Twitter/Twitter-eng.bin"/>
</group>

Next, the module contains groups to define the microcategories[^2] that will be affected to the token that will replace each recognized entity. The name of each of these groups must be the one of the corresponding group in the entities module concatenated to the string "Micros". It contains a list of microcategories for each entity. This list is named by the fully qualified name of the entity (<Group name>.<Entity name>):

<group name="TwitterMicros" class="SpecificEntitiesMicros">
      <list name="Twitter.TWITTERID">
        <item value="L_NOM_PROPRE"/>
      </list>
</group>

A complete configuration file

<?xml version='1.0' encoding='UTF-8'?>
<modulesConfig>
  <module name="entities">
    <group name="Twitter">
      <list name="entityList">
        <item value="TWITTERID"/>
        <item value="TWITTERHASH"/>
      </list>
    </group>
  </module>
  <module name="Processors">
    <group name="TwitterModex" class="ProcessUnitPipeline" >
      <list name="processUnitSequence">
        <item value="TwitterRecognition"/>
      </list>
    </group>
    <group name="TwitterRecognition" class="ApplyRecognizer">
      <param key="automaton" value="TwitterRules"/>
      <param key="applyOnGraph" value="AnalysisGraph"/>
      <param key="useSentenceBounds" value="no"/>
    </group>
  </module>
  <module name="resources-eng">
    <group name="TwitterRules" class="AutomatonRecognizer">
      <param key="rules" value="Twitter/Twitter-eng.bin"/>
    </group>
    <group name="TwitterMicros" class="SpecificEntitiesMicros">
      <list name="Twitter.TWITTERID">
        <item value="PROPN"/>
      </list>
      <list name="Twitter.TWITTERHASH">
        <item value="PROPN"/>
      </list>
    </group>
  </module>
</modulesConfig>

The rules files

The full syntax of rules files is described on the Modex Rules Format page. Here, we just describe the following example:

set encoding=utf8
using modex Twitter-modex.xml
using groups Twitter
set defaultAction=>CreateSpecificEntity()

#----------------------------------------------------------------------
# recognition of Twitter ids
#----------------------------------------------------------------------

@arobase=(\@)

@arobase::*:TWITTERID:

\#::*:TWITTERHASH:

The first four lines are metadata stating that the file is encoded in UTF-8, that it is a rules file for the Twitter Modex, that the entities created by rules will belong to the Twitter group and finally that by default, the action associated to the rules will be to create a specific entity.

Next comes a line describing a class of tokens, here just the tokens composed of the arobase character. After this line, there is two rules, one triggered by the encountering of an arobase and matching any token after it. If matching, a TWITTERID entity is created. The second one has the same format but triggered by a hash character token and creating a TWITTERHASH entity.

Please refer to the full syntax description for details, but let's say here that rules are defined by a triggering token, followed by a regular expression describing the left context of the triggering token and a second one describing its right context, followed by the type of the expression and possibly constraint functions.

When the rules file is ready, you have to compile it with the following command (don't forget to install the configuration file beforehand):

compile-rules --language=eng --modex=Twitter-modex.xml -oTwitter-eng.bin Twitter-eng.rules

Then copy the binary file to a Twitter folder in your resources directory listed in the LIMA_RESOURCES environment variable.

Using your new Modex

In the analysis configuration file (lima-lp-xyz.xml, copy it from the system folder to your configuration folder as described above), a Modex is included by including explicitly its processings and its resources :

<module name="entities">
  <group name="include">
    <list name="includeList">
      <item value="Twitter-modex.xml/entities"/>
    </list>
  </group>
</module>
<module name="Processors">
    <group name="include">
      <list name="includeList">
        <item value="Twitter-modex.xml/Processors"/>
      </list>
    </group>
  ...
  </module>
   <module name="Resources">
    <group name="include">
      <list name="includeList">
        <item value="Twitter-modex.xml/resources-xyz"/>
      </list>
    </group>
    ...
 </module>

The Modex process unit(s) can then be called in the various pipelines.

Integrating LIMA

There is several ways to integrate LIMA in your application. In this early version of this user manual, we will suppose that you just need to get the following information about the tokens present in the analyzed text: lemma, morphosyntactic category, position and entity type in case of specific entities. We also suppose that you want to directly invoke LIMA from your C++ code and support multithreading the calls to the analyzer.

The code accompanying this user manual implements this. The dowork function initializes the analyzer, prepares the list of files to be analyzed, this list being protected by a mutex, and creates the specified number of threads, binding them to the analyze_thread function. This latter one repeatedly peeks a file to analyze, prepare the handler that will allow to access the analysis result (here a BoWText), call the analysis client and then dumps the output using the BoW API.

[^1]: language code for English and French are ud-eng and ud-fre for historic reasons. For other languages, the language code is just the ISO trigram as listed by lima_models.py -i.

[^2]: this will be defined in a later version of this document. Currently, you can use the values NNP for English and NPP for French.