-
Notifications
You must be signed in to change notification settings - Fork 21
NER pipelines
Table of Contents generated with DocToc
LIMA includes predefined pipelines implementing Named Entities Recognition for English and French. These pipelines are available out of the box.
Pipeline | Input | Output* | Rules | RNN |
---|---|---|---|---|
ner-rules | plain text | CoNLL-03 | + | |
ner-deep | plain text | CoNLL-03 | + | |
ner-fusion | plain text | CoNLL-03 | + | + |
ner-rules-pretok | CoNLL-U | CoNLL-03 | + | |
ner-deep-pretok | CoNLL-U | CoNLL-03 | + | |
ner-fusion-pretok | CoNLL-U | CoNLL-03 | + | + |
* Pipelines can be configured for CoNLL-U output (see conllDumperNer processing unit configuration in lima_linguisticprocessing/conf/lima-lp-$LANG-CODE.xml
).
Rule-based NER is implemented using Modex rules. The corresponding source code (rules) is in the lima_linguisticdata/SpecificEntities/$LANG-CODE
directory.
RNN-based model for English is trained on CoNLL-03 dataset. RNN-based model for French - on WikiNER.
The ordinary LIMA installation is enough for pipelines with CoNLL-U input (treating pre-tokenized text).
The tokenization models are required for the pipelines treating plain text. These models can be installed with lima_models.py
script following the instruction in lima-models repository.
For RNN-based processing units to be available LIMA must be compiled with TensorFlow enabled. This is the default mode. If you are building LIMA from sources please check 'Installation' section on the UD pipelines page.
analyzeText -l LANG-CODE -p PIPELINE input_file.txt
LANG-CODE
- eng
or fre
PIPELINE
- one of pipelines mentioned above.
For rule-based processing of English text type:
analyzeText -l eng -p ner-rules input_file.txt
ner-rules | ner-deep | ner-fusion | ner-rules-pretok | ner-deep-pretok | ner-fusion-pretok | |
---|---|---|---|---|---|---|
Input: | ||||||
cpptftokenizer | + | + | + | |||
conllureader | + | + | + | |||
Pre-processing: | ||||||
simpleWord | + | + | + | + | + | + |
hyphenWordAlternatives | + | + | + | + | + | + |
defaultProperties | + | + | + | + | + | + |
RNN-based NER: | ||||||
tensorflowSpecificEntitiesFusion | + | + | + | + | ||
sentenceBoundariesUpdater | + | + | + | + | ||
Rule-based NER: | ||||||
SpecificEntitiesModex | + | + | + | + | ||
sentenceBoundariesUpdater | + | + | + | + | ||
Output: | ||||||
conllDumperNer | + | + | + | + | + | + |
For the up-to-date definitions of these pipelines please check the corresponding configuration files: lima_linguisticprocessing/conf/lima-lp-$LANG-CODE.xml
.
NER pipelines are evaluated in their pre-tokenized version (i.e. with CoNLL-U input).
processed 46435 tokens with 5616 phrases; found: 4440 phrases; correct: 2984.
accuracy: 92.45%; precision: 67.21%; recall: 53.13%; FB1: 59.35
LOC: precision: 64.31%; recall: 84.99%; FB1: 73.22 2202
MISC: precision: 90.32%; recall: 3.99%; FB1: 7.65 31
ORG: precision: 57.07%; recall: 20.10%; FB1: 29.73 580
PER: precision: 74.31%; recall: 75.47%; FB1: 74.88 1627
processed 46435 tokens with 5616 phrases; found: 5651 phrases; correct: 5021.
accuracy: 97.81%; precision: 88.85%; recall: 89.41%; FB1: 89.13
LOC: precision: 89.64%; recall: 93.46%; FB1: 91.51 1737
MISC: precision: 75.76%; recall: 78.03%; FB1: 76.88 722
ORG: precision: 86.68%; recall: 85.37%; FB1: 86.02 1622
PER: precision: 96.24%; recall: 94.32%; FB1: 95.27 1570
processed 46435 tokens with 5616 phrases; found: 5717 phrases; correct: 5016.
accuracy: 97.69%; precision: 87.74%; recall: 89.32%; FB1: 88.52
LOC: precision: 89.19%; recall: 93.64%; FB1: 91.36 1749
MISC: precision: 75.76%; recall: 78.03%; FB1: 76.88 722
ORG: precision: 85.09%; recall: 85.91%; FB1: 85.50 1663
PER: precision: 94.38%; recall: 93.26%; FB1: 93.81 1583
processed 3499655 tokens with 296413 phrases; found: 211853 phrases; correct: 128203.
accuracy: 92.12%; precision: 60.52%; recall: 43.25%; FB1: 50.45
LOC: precision: 64.89%; recall: 55.78%; FB1: 59.99 72577
MISC: precision: 57.84%; recall: 1.84%; FB1: 3.56 2144
ORG: precision: 41.70%; recall: 36.84%; FB1: 39.12 41009
PER: precision: 65.30%; recall: 64.00%; FB1: 64.64 96123
processed 3499655 tokens with 296413 phrases; found: 295295 phrases; correct: 186492.
accuracy: 94.80%; precision: 63.15%; recall: 62.92%; FB1: 63.04
LOC: precision: 68.52%; recall: 76.16%; FB1: 72.14 93855
MISC: precision: 50.49%; recall: 36.02%; FB1: 42.05 48138
ORG: precision: 44.16%; recall: 60.41%; FB1: 51.02 63500
PER: precision: 77.77%; recall: 71.21%; FB1: 74.34 89802
processed 3499655 tokens with 296413 phrases; found: 302747 phrases; correct: 190013.
accuracy: 94.88%; precision: 62.76%; recall: 64.10%; FB1: 63.43
LOC: precision: 67.66%; recall: 76.30%; FB1: 71.72 95222
MISC: precision: 50.38%; recall: 36.03%; FB1: 42.01 48264
ORG: precision: 43.62%; recall: 61.13%; FB1: 50.91 65060
PER: precision: 77.38%; recall: 74.33%; FB1: 75.82 94201
processed 3499679 tokens with 251726 phrases; found: 141976 phrases; correct: 102255.
accuracy: 92.67%; precision: 72.02%; recall: 40.62%; FB1: 51.95
LOC: precision: 68.93%; recall: 45.60%; FB1: 54.89 74057
MISC: precision: 45.09%; recall: 4.74%; FB1: 8.57 4096
ORG: precision: 53.15%; recall: 33.15%; FB1: 40.84 15262
PER: precision: 84.94%; recall: 54.04%; FB1: 66.06 48561
processed 3499679 tokens with 251726 phrases; found: 251529 phrases; correct: 234393.
accuracy: 99.18%; precision: 93.19%; recall: 93.11%; FB1: 93.15
LOC: precision: 92.84%; recall: 92.85%; FB1: 92.85 111958
MISC: precision: 89.94%; recall: 86.71%; FB1: 88.29 37592
ORG: precision: 91.13%; recall: 90.52%; FB1: 90.82 24305
PER: precision: 95.90%; recall: 97.60%; FB1: 96.75 77674
processed 3499679 tokens with 251726 phrases; found: 256512 phrases; correct: 231141.
accuracy: 98.66%; precision: 90.11%; recall: 91.82%; FB1: 90.96
LOC: precision: 86.82%; recall: 90.16%; FB1: 88.46 116249
MISC: precision: 88.85%; recall: 86.71%; FB1: 87.77 38051
ORG: precision: 89.69%; recall: 89.71%; FB1: 89.70 24474
PER: precision: 95.77%; recall: 97.55%; FB1: 96.65 77738
ner-rules | ner-deep | ner-fusion | |
---|---|---|---|
tokens / user time | 6400 tok/sec | 295 tok/sec | 285 tok/sec |
tokens / real time | 6400 tok/sec | 1900 tok/sec | 1500 tok/sec |
Rule-based NER does the single-thread processing.
RNN-based NER used TensorFlow for computations and uses all available CPU cores.
Table of Contents generated with DocToc