NER pipelines

Table of Contents generated with DocToc

Installation
Examples
Processing units used
Evaluation
- English
  - CoNLL-03 dataset (eng.testb)
  - WikiNER (aij-wikiner-en-wp2)
- French
  - WikiNER (aij-wikiner-fr-wp2)
Computation speed

LIMA includes predefined pipelines implementing Named Entities Recognition for English and French. These pipelines are available out of the box.

Pipeline	Input	Output*	Rules	RNN
ner-rules	plain text	CoNLL-03	+
ner-deep	plain text	CoNLL-03		+
ner-fusion	plain text	CoNLL-03	+	+
ner-rules-pretok	CoNLL-U	CoNLL-03	+
ner-deep-pretok	CoNLL-U	CoNLL-03		+
ner-fusion-pretok	CoNLL-U	CoNLL-03	+	+

* Pipelines can be configured for CoNLL-U output (see conllDumperNer processing unit configuration in lima_linguisticprocessing/conf/lima-lp-$LANG-CODE.xml).

Rule-based NER is implemented using Modex rules. The corresponding source code (rules) is in the lima_linguisticdata/SpecificEntities/$LANG-CODE directory.

RNN-based model for English is trained on CoNLL-03 dataset. RNN-based model for French - on WikiNER.

Installation

The ordinary LIMA installation is enough for pipelines with CoNLL-U input (treating pre-tokenized text).

The tokenization models are required for the pipelines treating plain text. These models can be installed with lima_models.py script following the instruction in lima-models repository.

For RNN-based processing units to be available LIMA must be compiled with TensorFlow enabled. This is the default mode. If you are building LIMA from sources please check 'Installation' section on the UD pipelines page.

Examples

analyzeText -l LANG-CODE -p PIPELINE input_file.txt

LANG-CODE - eng or fre

PIPELINE - one of pipelines mentioned above.

For rule-based processing of English text type:

analyzeText -l eng -p ner-rules input_file.txt

Processing units used

	ner-rules	ner-deep	ner-fusion	ner-rules-pretok	ner-deep-pretok	ner-fusion-pretok
Input:
cpptftokenizer	+	+	+
conllureader				+	+	+
Pre-processing:
simpleWord	+	+	+	+	+	+
hyphenWordAlternatives	+	+	+	+	+	+
defaultProperties	+	+	+	+	+	+
RNN-based NER:
tensorflowSpecificEntitiesFusion		+	+		+	+
sentenceBoundariesUpdater		+	+		+	+
Rule-based NER:
SpecificEntitiesModex	+		+	+		+
sentenceBoundariesUpdater	+		+	+		+
Output:
conllDumperNer	+	+	+	+	+	+

For the up-to-date definitions of these pipelines please check the corresponding configuration files: lima_linguisticprocessing/conf/lima-lp-$LANG-CODE.xml.

Evaluation

NER pipelines are evaluated in their pre-tokenized version (i.e. with CoNLL-U input).

English

CoNLL-03 dataset (eng.testb)

Rules only (ner-rules-pretok)

processed 46435 tokens with 5616 phrases; found: 4440 phrases; correct: 2984.
accuracy:  92.45%; precision:  67.21%; recall:  53.13%; FB1:  59.35
              LOC: precision:  64.31%; recall:  84.99%; FB1:  73.22  2202
             MISC: precision:  90.32%; recall:   3.99%; FB1:   7.65  31
              ORG: precision:  57.07%; recall:  20.10%; FB1:  29.73  580
              PER: precision:  74.31%; recall:  75.47%; FB1:  74.88  1627

RNN only (ner-deep-pretok)

processed 46435 tokens with 5616 phrases; found: 5651 phrases; correct: 5021.
accuracy:  97.81%; precision:  88.85%; recall:  89.41%; FB1:  89.13
              LOC: precision:  89.64%; recall:  93.46%; FB1:  91.51  1737
             MISC: precision:  75.76%; recall:  78.03%; FB1:  76.88  722
              ORG: precision:  86.68%; recall:  85.37%; FB1:  86.02  1622
              PER: precision:  96.24%; recall:  94.32%; FB1:  95.27  1570

RNN + Rules (ner-fusion-pretok)

processed 46435 tokens with 5616 phrases; found: 5717 phrases; correct: 5016.
accuracy:  97.69%; precision:  87.74%; recall:  89.32%; FB1:  88.52
              LOC: precision:  89.19%; recall:  93.64%; FB1:  91.36  1749
             MISC: precision:  75.76%; recall:  78.03%; FB1:  76.88  722
              ORG: precision:  85.09%; recall:  85.91%; FB1:  85.50  1663
              PER: precision:  94.38%; recall:  93.26%; FB1:  93.81  1583

WikiNER (aij-wikiner-en-wp2)

Rules only (ner-rules-pretok)

processed 3499655 tokens with 296413 phrases; found: 211853 phrases; correct: 128203.
accuracy:  92.12%; precision:  60.52%; recall:  43.25%; FB1:  50.45
              LOC: precision:  64.89%; recall:  55.78%; FB1:  59.99  72577
             MISC: precision:  57.84%; recall:   1.84%; FB1:   3.56  2144
              ORG: precision:  41.70%; recall:  36.84%; FB1:  39.12  41009
              PER: precision:  65.30%; recall:  64.00%; FB1:  64.64  96123

RNN only (ner-deep-pretok)

processed 3499655 tokens with 296413 phrases; found: 295295 phrases; correct: 186492.
accuracy:  94.80%; precision:  63.15%; recall:  62.92%; FB1:  63.04
              LOC: precision:  68.52%; recall:  76.16%; FB1:  72.14  93855
             MISC: precision:  50.49%; recall:  36.02%; FB1:  42.05  48138
              ORG: precision:  44.16%; recall:  60.41%; FB1:  51.02  63500
              PER: precision:  77.77%; recall:  71.21%; FB1:  74.34  89802

RNN + Rules (ner-fusion-pretok)

processed 3499655 tokens with 296413 phrases; found: 302747 phrases; correct: 190013.
accuracy:  94.88%; precision:  62.76%; recall:  64.10%; FB1:  63.43
              LOC: precision:  67.66%; recall:  76.30%; FB1:  71.72  95222
             MISC: precision:  50.38%; recall:  36.03%; FB1:  42.01  48264
              ORG: precision:  43.62%; recall:  61.13%; FB1:  50.91  65060
              PER: precision:  77.38%; recall:  74.33%; FB1:  75.82  94201

French

WikiNER (aij-wikiner-fr-wp2)

Rules only (ner-rules-pretok)

processed 3499679 tokens with 251726 phrases; found: 141976 phrases; correct: 102255.
accuracy:  92.67%; precision:  72.02%; recall:  40.62%; FB1:  51.95
              LOC: precision:  68.93%; recall:  45.60%; FB1:  54.89  74057
             MISC: precision:  45.09%; recall:   4.74%; FB1:   8.57  4096
              ORG: precision:  53.15%; recall:  33.15%; FB1:  40.84  15262
              PER: precision:  84.94%; recall:  54.04%; FB1:  66.06  48561

RNN only (ner-deep-pretok)

processed 3499679 tokens with 251726 phrases; found: 251529 phrases; correct: 234393.
accuracy:  99.18%; precision:  93.19%; recall:  93.11%; FB1:  93.15
              LOC: precision:  92.84%; recall:  92.85%; FB1:  92.85  111958
             MISC: precision:  89.94%; recall:  86.71%; FB1:  88.29  37592
              ORG: precision:  91.13%; recall:  90.52%; FB1:  90.82  24305
              PER: precision:  95.90%; recall:  97.60%; FB1:  96.75  77674

RNN + Rules (ner-fusion-pretok)

processed 3499679 tokens with 251726 phrases; found: 256512 phrases; correct: 231141.
accuracy:  98.66%; precision:  90.11%; recall:  91.82%; FB1:  90.96
              LOC: precision:  86.82%; recall:  90.16%; FB1:  88.46  116249
             MISC: precision:  88.85%; recall:  86.71%; FB1:  87.77  38051
              ORG: precision:  89.69%; recall:  89.71%; FB1:  89.70  24474
              PER: precision:  95.77%; recall:  97.55%; FB1:  96.65  77738

Computation speed

	ner-rules	ner-deep	ner-fusion
tokens / user time	6400 tok/sec	295 tok/sec	285 tok/sec
tokens / real time	6400 tok/sec	1900 tok/sec	1500 tok/sec

Rule-based NER does the single-thread processing.

RNN-based NER used TensorFlow for computations and uses all available CPU cores.

Table of Contents generated with DocToc

The LIMA multilingual NLP tool

The LIMA multilingual NLP tool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER pipelines

Installation

Examples

Processing units used

Evaluation

English

CoNLL-03 dataset (eng.testb)

Rules only (ner-rules-pretok)

RNN only (ner-deep-pretok)

RNN + Rules (ner-fusion-pretok)

WikiNER (aij-wikiner-en-wp2)

Rules only (ner-rules-pretok)

RNN only (ner-deep-pretok)

RNN + Rules (ner-fusion-pretok)

French

WikiNER (aij-wikiner-fr-wp2)

Rules only (ner-rules-pretok)

RNN only (ner-deep-pretok)

RNN + Rules (ner-fusion-pretok)

Computation speed

The LIMA multilingual NLP tool

Clone this wiki locally