Terminology Extraction - Paper Summary

ATE: Automatic term extraction (TermEval 2020)

TermEval 2020: a platform for researchers to work on ATE.
ATE: the automated process of identifying terminology from a corpus of specialised texts.
Terms: lexical items that represent concepts of a domain.

Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset.

Dataset: Annotated Corpora for Term Extraction Research (ACTER) - v1.2

Descriptions:
- Domains: Corruption, dressage, wind energy (train), heart failure (test).
- Languages: : English, French and Dutch.
- ~50k tokens/language/domain manually annotated
  - Unstructured lists of all unique annotated terms.

Labels: term or not (binary task)

Named Entities (optional)
True terms

	Specific Terms	Common Terms	Out-Of-Domain Terms
Domain-specific Lexical-specific	x x	x o	o x

2 datasets: with and without Named Entities

Evaluation Metrics:
- Precision: how many of the extracted terms are correct.
- Recall: how many of the terms in the text have correctly been extracted.
- F1-score: harmonic mean (gold standard with only terms and with both terms and Named Entities).
Methodology:
- NYU: Termolator on English version.
  - Select candidate terms based on chunking and abbreviations.
  - Calculate distribution metrics, well-formedness, relevance score.
- RACAI: Combine several statistical approachs and vote to generate results on English version only.
  - TextRank, TFIDF, clustering, termhood features.
- e-Terminology:
  - TSR (Token Slot Recognition) technique in TBXTools.
    - Dutch: statistical version
    - Enlish, French: linguistic version
  - Filter out stopwords and f(terms) <= 2.
  - Terminological reference: IATE database for 12-Law.
- MLPLab_UQAM: Bidirectional LSTM with GloVe embeddings on 3 languages.
- TALN-LS2N: only English, French (described in next paper).
Notes:
- TALN-LS2N’s system outperforms all others in the English and French tracks.
- NLPLab UQAM’s system outperforms e-Terminology for the Dutch track.
- Unpredictability of DL models (BERT)
  - Large gap between precision and recall for English model, much smaller for French model.
- ACTER v1.3
  - Data description in README.html.

TALN-LS2N System for Automatic Term Extraction.

Dataset: Annotated Corpora for Term Extraction Research (ACTER)
- Training phase: Corruption, dressage, wind energy.
- Test phase: Heart failure.
- Languages: : English, French and Dutch.
Proposed systems:

Feature-based approaches

Context-based approaches

Feature Extraction

Linguistic filtering

spaCy’s rule-matching engine

Candidate describing

Linguistic, stylistic, statistic, and distributional descriptors
Termhood: degree to which a linguistic unit is related to domain-specific context
Measures: Specificity, Term’s relation to Context, Cvalue, Termhood

Selection phase

Classification - Boosting

sklearn standard scaler
eXtreme Gradient Boosting (XGBoost)

Formats:
- Input: The sentence contains the term
- Output: The term
Models

English: RoBERTa

Modify key hyperparams in original BERT
Eliminate its next sentence pretraining objective
Train the model with much larger mini-batches and more substantial learning rates

French: CamemBERT
Use pre-trained models and fine-tuned during the classification

Notes:
- BERT outperforms classical methods
- New, simple and strong baseline for terminology extraction

Contributors:

🐮 @honghanhh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary.md

Summary.md

Terminology Extraction - Paper Summary

ATE: Automatic term extraction (TermEval 2020)

Dataset: Annotated Corpora for Term Extraction Research (ACTER) - v1.2

Evaluation Metrics:

Methodology:

Notes:

Dataset: Annotated Corpora for Term Extraction Research (ACTER)

Proposed systems:

Notes:

Contributors:

Files

Summary.md

Latest commit

History

Summary.md

File metadata and controls

Terminology Extraction - Paper Summary

ATE: Automatic term extraction (TermEval 2020)

Dataset: Annotated Corpora for Term Extraction Research (ACTER) - v1.2

Evaluation Metrics:

Methodology:

Notes:

Dataset: Annotated Corpora for Term Extraction Research (ACTER)

Proposed systems:

Notes:

Contributors: