Skip to content
Gael de Chalendar edited this page Jun 17, 2021 · 11 revisions

Table of Contents generated with DocToc

LIMA includes predefined pipelines implementing Named Entities Recognition for English and French. These pipelines are available out of the box.

Pipeline Input Output* Rules RNN
ner-rules plain text CoNLL-03 +
ner-deep plain text CoNLL-03 +
ner-fusion plain text CoNLL-03 + +
ner-rules-pretok CoNLL-U CoNLL-03 +
ner-deep-pretok CoNLL-U CoNLL-03 +
ner-fusion-pretok CoNLL-U CoNLL-03 + +

* Pipelines can be configured for CoNLL-U output (see conllDumperNer processing unit configuration in lima_linguisticprocessing/conf/lima-lp-$LANG-CODE.xml).

Rule-based NER is implemented using Modex rules. The corresponding source code (rules) is in the lima_linguisticdata/SpecificEntities/$LANG-CODE directory.

RNN-based model for English is trained on CoNLL-03 dataset. RNN-based model for French - on WikiNER.

Installation

The ordinary LIMA installation is enough for pipelines with CoNLL-U input (treating pre-tokenized text).

The tokenization models are required for the pipelines treating plain text. These models can be installed with lima_models.py script following the instruction in lima-models repository.

For RNN-based processing units to be available LIMA must be compiled with TensorFlow enabled. This is the default mode. If you are building LIMA from sources please check 'Installation' section on the UD pipelines page.

Examples

analyzeText -l LANG-CODE -p PIPELINE input_file.txt

LANG-CODE - eng or fre

PIPELINE - one of pipelines mentioned above.

For rule-based processing of English text type:

analyzeText -l eng -p ner-rules input_file.txt

Processing units used

ner-rules ner-deep ner-fusion ner-rules-pretok ner-deep-pretok ner-fusion-pretok
Input:
cpptftokenizer + + +
conllureader + + +
Pre-processing:
simpleWord + + + + + +
hyphenWordAlternatives + + + + + +
defaultProperties + + + + + +
RNN-based NER:
tensorflowSpecificEntitiesFusion + + + +
sentenceBoundariesUpdater + + + +
Rule-based NER:
SpecificEntitiesModex + + + +
sentenceBoundariesUpdater + + + +
Output:
conllDumperNer + + + + + +

For the up-to-date definitions of these pipelines please check the corresponding configuration files: lima_linguisticprocessing/conf/lima-lp-$LANG-CODE.xml.

Evaluation

NER pipelines are evaluated in their pre-tokenized version (i.e. with CoNLL-U input).

English

CoNLL-03 dataset (eng.testb)

Rules only (ner-rules-pretok)

processed 46435 tokens with 5616 phrases; found: 4440 phrases; correct: 2984.
accuracy:  92.45%; precision:  67.21%; recall:  53.13%; FB1:  59.35
              LOC: precision:  64.31%; recall:  84.99%; FB1:  73.22  2202
             MISC: precision:  90.32%; recall:   3.99%; FB1:   7.65  31
              ORG: precision:  57.07%; recall:  20.10%; FB1:  29.73  580
              PER: precision:  74.31%; recall:  75.47%; FB1:  74.88  1627

RNN only (ner-deep-pretok)

processed 46435 tokens with 5616 phrases; found: 5651 phrases; correct: 5021.
accuracy:  97.81%; precision:  88.85%; recall:  89.41%; FB1:  89.13
              LOC: precision:  89.64%; recall:  93.46%; FB1:  91.51  1737
             MISC: precision:  75.76%; recall:  78.03%; FB1:  76.88  722
              ORG: precision:  86.68%; recall:  85.37%; FB1:  86.02  1622
              PER: precision:  96.24%; recall:  94.32%; FB1:  95.27  1570

RNN + Rules (ner-fusion-pretok)

processed 46435 tokens with 5616 phrases; found: 5717 phrases; correct: 5016.
accuracy:  97.69%; precision:  87.74%; recall:  89.32%; FB1:  88.52
              LOC: precision:  89.19%; recall:  93.64%; FB1:  91.36  1749
             MISC: precision:  75.76%; recall:  78.03%; FB1:  76.88  722
              ORG: precision:  85.09%; recall:  85.91%; FB1:  85.50  1663
              PER: precision:  94.38%; recall:  93.26%; FB1:  93.81  1583

WikiNER (aij-wikiner-en-wp2)

Rules only (ner-rules-pretok)

processed 3499655 tokens with 296413 phrases; found: 211853 phrases; correct: 128203.
accuracy:  92.12%; precision:  60.52%; recall:  43.25%; FB1:  50.45
              LOC: precision:  64.89%; recall:  55.78%; FB1:  59.99  72577
             MISC: precision:  57.84%; recall:   1.84%; FB1:   3.56  2144
              ORG: precision:  41.70%; recall:  36.84%; FB1:  39.12  41009
              PER: precision:  65.30%; recall:  64.00%; FB1:  64.64  96123

RNN only (ner-deep-pretok)

processed 3499655 tokens with 296413 phrases; found: 295295 phrases; correct: 186492.
accuracy:  94.80%; precision:  63.15%; recall:  62.92%; FB1:  63.04
              LOC: precision:  68.52%; recall:  76.16%; FB1:  72.14  93855
             MISC: precision:  50.49%; recall:  36.02%; FB1:  42.05  48138
              ORG: precision:  44.16%; recall:  60.41%; FB1:  51.02  63500
              PER: precision:  77.77%; recall:  71.21%; FB1:  74.34  89802

RNN + Rules (ner-fusion-pretok)

processed 3499655 tokens with 296413 phrases; found: 302747 phrases; correct: 190013.
accuracy:  94.88%; precision:  62.76%; recall:  64.10%; FB1:  63.43
              LOC: precision:  67.66%; recall:  76.30%; FB1:  71.72  95222
             MISC: precision:  50.38%; recall:  36.03%; FB1:  42.01  48264
              ORG: precision:  43.62%; recall:  61.13%; FB1:  50.91  65060
              PER: precision:  77.38%; recall:  74.33%; FB1:  75.82  94201

French

WikiNER (aij-wikiner-fr-wp2)

Rules only (ner-rules-pretok)

processed 3499679 tokens with 251726 phrases; found: 141976 phrases; correct: 102255.
accuracy:  92.67%; precision:  72.02%; recall:  40.62%; FB1:  51.95
              LOC: precision:  68.93%; recall:  45.60%; FB1:  54.89  74057
             MISC: precision:  45.09%; recall:   4.74%; FB1:   8.57  4096
              ORG: precision:  53.15%; recall:  33.15%; FB1:  40.84  15262
              PER: precision:  84.94%; recall:  54.04%; FB1:  66.06  48561

RNN only (ner-deep-pretok)

processed 3499679 tokens with 251726 phrases; found: 251529 phrases; correct: 234393.
accuracy:  99.18%; precision:  93.19%; recall:  93.11%; FB1:  93.15
              LOC: precision:  92.84%; recall:  92.85%; FB1:  92.85  111958
             MISC: precision:  89.94%; recall:  86.71%; FB1:  88.29  37592
              ORG: precision:  91.13%; recall:  90.52%; FB1:  90.82  24305
              PER: precision:  95.90%; recall:  97.60%; FB1:  96.75  77674

RNN + Rules (ner-fusion-pretok)

processed 3499679 tokens with 251726 phrases; found: 256512 phrases; correct: 231141.
accuracy:  98.66%; precision:  90.11%; recall:  91.82%; FB1:  90.96
              LOC: precision:  86.82%; recall:  90.16%; FB1:  88.46  116249
             MISC: precision:  88.85%; recall:  86.71%; FB1:  87.77  38051
              ORG: precision:  89.69%; recall:  89.71%; FB1:  89.70  24474
              PER: precision:  95.77%; recall:  97.55%; FB1:  96.65  77738

Computation speed

ner-rules ner-deep ner-fusion
tokens / user time 6400 tok/sec 295 tok/sec 285 tok/sec
tokens / real time 6400 tok/sec 1900 tok/sec 1500 tok/sec

Rule-based NER does the single-thread processing.

RNN-based NER used TensorFlow for computations and uses all available CPU cores.