Skip to content
CENL-AI-WG edited this page Oct 15, 2021 · 125 revisions

Named Entity Recognition Statut

Locate and classify named entities mentioned in unstructured text into pre-defined categories (person names, organizations, locations, time expressions, quantities, etc.)

Keywords: NER

Approaches: rule-based, machine learning

Tools: Stanford NER, Corleone, AllenNLP, Spacy, flair, BERT

Table of content


NER on newspapers

NER applied to a newspaper front page (The New York Herald, 1888). Source: retronews.fr

Goals

Performing named entity recognition (NER) in natural language unstructured text or in the text within metadata is useful in a variety of use cases, from document management and knowledge organisation to information extraction and information retrieval.

The main tasks involve named entity recognition (identify a portion of text as an entity), categorisation (determine the nature of the entity, e.g. a person's name), disambiguation/linking (ascertain a person’s name in a non ambiguous manner) and relation extraction (discover relations between named entities).

Real world applications of NER in libraries are:

  • information retrieval: named entities used as a resource for information retrieval use cases in digital libraries
  • population of knowledge bases: enrichment of catalog records with named entities information; linking records between knowledge bases
  • cross-lingual document clustering: documents mentioning the same entities are likely to be linked
  • summarization: named entities are informational ’anchors’ helping to identify key elements of a text
  • anonymization: removing named entities (particularly person's names) from documents
  • text analytics use case: digital humanities, quantitative analysis, etc.

Tutorial

Recognition and categorisation of named entities make use of morphological, lexical or contextual features through rules, gazetters and other linguistic resources. In real-life systems, these kind of clues are never totally reliable (in particular for historical materials) and statistical models are needed.

The following sections expose a variety of approaches and techniques for NER. This survey is a recent resource that is highly recommended reading.

Introduction

Commercial AI or NLP platforms on-line demos can give a broad sense of what NER is when applied to heritage textual documents:

Hands-on Text samples (e.g. taken from the New York Herald) can be copy/paste to the demos. The following illustration shows an AllenNLP NER model applied to a paragraph of text.

AllenNLP demo

AllenNLP NER model applied to newspaper content

These out-of-the-box systems have the advantage of being immediately operational, especially for the English language. On the other hand, since NER is known to be domain sensitive, they will not provide the best results.

Rule-based systems (linguistic grammar-based techniques)

Rule-based systems are the first approach used for NER. Manually-crafted rules are generally expressed as regular expressions which combine morphological clues (like uppercase), lexical clues (names, titles) and contextual (local grammar) clues.

Then the input text needs to traverse a finite-state automaton (their execution is fully “automatic” but they need manually-crafted rules). When the automaton strikes a matching rule, it leads to the action part of the rule for constructing the named entity annotation.

Some NER platforms like GATE, OpenNLP, SpaCy etc. support rule-based NER:

Hands-on The impresso NER tutorial includes a hand-on session on a rule-based system and its gazetteers component.

Pro/Cons:

  • Developers and linguists create language-specific and domain-specific linguistic resources (gazetteers, set of rules...), which can be very time consuming. These resources are used for development and evaluation.
  • Developer is in control of the overall annotation pipeline and

Machine learning (statistical models)

Annotators create annotated text corpora according to the target typology of entities. Statistical model can then learns from these annotated data.

Data is used for training, development and evaluation, and developer only specifies features, statistical model, and learning algorithm.

Pro/Cons:

  • Tools are language-independent and the annotation task does not require skilled linguists.
  • But statistical NER typically requires a large amount of manually annotated training data (annotated data is in control).

The next sections introduce various machine learning-based NER.

CRF (Conditional Random Fields)

Conditional random field (CRF) are a class of statistical modeling method used for structured prediction (introduction to Conditional Random Field are quoted in the resources section). A CRF model can take context into account and consider "neighboring" samples, an essential feature for text processing and particularly for NER, which can be cast as a sequence labeling problem. For natural language processing, linear chain CRFs are popular, which implement sequential dependencies in the predictions. Skip-chain CRF, another variant, can handle long-distance dependency between the text flow.

Hands-on Stanford NLP Group's named entity recognizer is an implementation of linear chain CRF sequence models. Stanford NER is available for download and the package includes components for command-line invocation, running as a server and a Java API. It can also be tested on line.

Stanford NER

Stanford NER has been used for applying NER to heritage newspapers during the Europeana Newspapers project. Annotated datasets (BIO tagging scheme) for a variety of languages can be downloaded from the project's github. A model for French, trained on 200 chunks of 1,000 words each extracted from newspapers (1870-1945) is available on api.bnf.fr. A classical beginning-inside outside (BIO) tagging scheme is used to distinguish multiple adjacent instances of the same type of named entity and a named entity spanning multiple words. After downloading the EN-Stanford.zip archive, open a terminal and launch the Java annotator:

> java -mx500m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier BnF.ner-model.ser.gz -outputFormat inlineXML -textFile sample.txt 

It outputs an annotated text (inline IOB or inline XML formats) with 3 named entity categories (Person, Location, Organization).

Le naufrage du Titanic constitue bien la
plus effroyable catastrophe maritime que
l'on ait eu à enregistrer jusqu'à présent.
Après les angoisses suscitées par les 
premières dépêches annonçant l'accident, on
s'était remis à espérer. Les dépêches de
<I-LIEU>New-York</I-LIEU>, d'Halifax et 
de <I-LIEU>Montréa</I-LIEU>
avaient en partie dissipé Jes terribles 
appréhensions qui avaient d'abord étreint
tous les coeurs.
...
On y trouve cependant les noms de MM.
<I-PERS>Bruce Ismay</I-PERS>, président de la <I-ORG>White Star
Line</I-ORG>, J.-B. <I-PERS>Thayer</I-PERS>, président du 
<I-ORG>Pensylvania Raiiroad</I-ORG>.

Hands-on This post blog demonstrates how to use the CRF implementation provided by the sklearn Python package.

Pro/Cons:

  • CRF has higher accuracy than other classical methods (Hidden Markov model, MaxEnt). – Training and inference can be slow.
  • CRF, like all supervised training methods, requires data that has been annotated for a specific task. Enhancing supervised methods with unsupervised text representations can alleviate this issue. These representations can be trained on large unannotated corpora and can learn implicit semantic and syntactic information.

Distributional semantics

Rule-based and statistical NLP methods consider words as atomic symbols. On the contrary, distributional semantics try to quantifying and categorizing semantic similarities between linguistic items based on their distributional properties in very large corpora of language data. The underlying idea of distributional semantics can be summed up in this hypothesis: linguistic items with similar distributions have similar meanings.

Hands-on The WebVectors demo lets you submit a word and get its nearest semantic associates. Semantic similarity between words is calculated as a distance (cosine similarity) between their corresponding vectors. We can see that this similarity reflects both syntactic and semantic similarity.

WebVectors

Capturing these distributional characteristics and using them for practical application (like word prediction, survey analysis, recommendation, etc.) to measure similarity between words, phrases, or documents can be done with a variety of techniques, from the vector space model (each word is represented by a vector, whose dimension corresponds to the size of the vocabulary) which results in very sparse vector space of high dimensionality, to word embedding techniques, which are a perfect input for numeric machine learning methods.

Word embedding techniques rely on a variety of approaches, neural network inspired, probabilistic, algebraic. word2vec (2013) is the most successful example of word embeddings.

Hands-on This resource visually explains the underlying concepts and the way word embeddings are trained from text data.

Hands-on This word2vec demo trains word embeddings in the browser, given an input text.

For a NER task, the hypothesis is made that word vectors belonging to the same NE category occur in close vicinity in the vector space of the word embeddings. Applying a classification approach on the vectors of words learns a decision boundary between the NER classes. The next figure illustrates this hypothesis with the WebVectors demo. Proper nouns vectors are closely clustered.

word vectors and NER

Pro/Cons:

  • Pre-trained word embeddings can be used in NLP tasks that use small amounts of labeled data.
  • word2vec (and other similar approaches like GloVe or FastText) develops a unique representation for each word (by making a synthesis of its different possible contexts), which means polysemy and homonymy are not handled properly. Moderns approaches produce a representation of each word in its particular context in the sentence (such as ELMo, BERT, GPT).
  • word2vec is vocabulary based. Most of the recent methods deal with frequent sequences of characters, which allows them to represent also "out of vocabulary" words (for which no representation has been previously learned) by combining representations of parts of these words.

Deep learning approaches

Recurrent neural (RNN) network based models have been proposed to tackle sequence tagging problems like named entity recognition. Neural nets enable an effective representation learning and they have full access to contextual cues needed for NER. A variety of NN based models for sequence tagging task have succeeded each other over the years, and they are considered as the state of the art: LSTM, bidirectional LSTM (BI-LSTM), LSTM networks with a CRF layer (LSTM-CRF), bidirectional LSTM networks with a CRF layer (BI-LSTM-CRF). Check out this article for an introduction to these different architectures or this one for the theoretical background. On LSTM architecture, read this post blog. Chapter 9 of the SLP "bible" is another essential reading.

In a NLP context, to understand a sentence we need to process the data in a given sequence, interpreting each word in the context of the words that have come before it. RNN support processing of sequential data by the addition of a loop. This loop allows the network to step through sequential input data whilst persisting the state of nodes in the hidden layer between steps.

RNN

RNN maintain a memory based on history information using a hidden layer or specially designed cells, which enables the model to predict the current output conditioned on long distance features.

RNN memory

For a NER task, the network learns to output the most probable NER tags sequence. Its input layer represents features (one-hot-encoding for word feature or dense vector features). The input layer has the same dimensionality as feature size. Its output layer represents a probability distribution over named entity categories labels (it has the same dimensionality as size of NE categories).

Hands-on Sequence tagging with a LSTM using Python

Hands-on impresso tutorial: neural NER with Spacy and Flair

Pro/Cons:

  • Character and word embeddings trained on large text corpora contain a lot of morphological and lexical information
  • Explicit feature engineering can be reduced to a minimum
  • Relatively small amounts of task-specific annotation data give good performance

Language models

Big transformer-based language models like BERT (Devlin et al., 2019) have become increasingly popular in NLP due to their high performance. They are based on the principle of transfer learning: in the self-supervised pre-training phase they learn general language properties from large amounts of text, which can then be applied to specific downstream tasks through fine-tuning.

Pre-training a big transformer model is rather resource intensive: it requires many GB of text data and gpu-accelerated machines. Fortunately, a lot of pre-trained models of different flavors can be freely downloaded from the Huggingface repository and fine-tuned for your particular needs. Fine-tuning does not require as many computational resources as pre-training but it does require an annotated dataset for the task at hand.

In order to fine-tune a language model for NER, an annotated NER dataset is required. Open source datasets are available for most major languages, but the quality does vary. Annotating a new dataset or supplementing an existing one might be necessary to have full control of the type of entities that are recognized. As an example, in a library contest it might be of particular interest to have a model that can accurately recognize works of art and publishers. In order to do this, additional material migh have to be annotated and added to the model fine-tuning step."

Hands-on AllenNLP platform makes available pretrained models like the Fine Grained Named Entity Recognition tagger: "This model identifies a broad range of 16 semantic types in the input text. It is a reimplementation of Lample (2016) and uses a bi-LSTM with a CRF layer, character embeddings and ELMo embeddings"). The Model Usage tab from the AllenNLP NER demo page shows how to proceed. See also this tutorial.

A basic Python script (AllenNLP folder) applied the model to a sample of the New York Herald:

>pip3  install allennlp==2.1.0 allennlp-models==2.1.0
>python3 AllenNLP-NER.py

Hands-on This resource shows how one can fine-tune the BERT model to perform named entity recognition.

See also how a language model for Swedish is first built and then used for classical NLP task like POS tagging and named entity recognition.

Hands-on A Hugging Face NER demo for Swedish is available.

Pro/Cons:

  • The best performances
  • Training a transformer model is computationally intensive

Other resources

Implementations