Skip to content

Latest commit

 

History

History
180 lines (159 loc) · 7.1 KB

Summary.md

File metadata and controls

180 lines (159 loc) · 7.1 KB

Terminology Extraction - Paper Summary

ATE: Automatic term extraction (TermEval 2020)


  • TermEval 2020: a platform for researchers to work on ATE.
  • ATE: the automated process of identifying terminology from a corpus of specialised texts.
  • Terms: lexical items that represent concepts of a domain.
  1. Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset.
  • Dataset: Annotated Corpora for Term Extraction Research (ACTER) - v1.2

    • Descriptions:

      • Domains: Corruption, dressage, wind energy (train), heart failure (test).
      • Languages: : English, French and Dutch.
      • ~50k tokens/language/domain manually annotated
        • Unstructured lists of all unique annotated terms.
    • Labels: term or not (binary task)

      • Named Entities (optional)
      • True terms
      Specific Terms Common Terms Out-Of-Domain Terms
      • Domain-specific
      • Lexical-specific
      • x
      • x
      • x
      • o
      • o
      • x
    • 2 datasets: with and without Named Entities

  • Evaluation Metrics:

    • Precision: how many of the extracted terms are correct.
    • Recall: how many of the terms in the text have correctly been extracted.
    • F1-score: harmonic mean (gold standard with only terms and with both terms and Named Entities).
  • Methodology:

    • NYU: Termolator on English version.

      • Select candidate terms based on chunking and abbreviations.
      • Calculate distribution metrics, well-formedness, relevance score.
    • RACAI: Combine several statistical approachs and vote to generate results on English version only.

      • TextRank, TFIDF, clustering, termhood features.
    • e-Terminology:

      • TSR (Token Slot Recognition) technique in TBXTools.
        • Dutch: statistical version
        • Enlish, French: linguistic version
      • Filter out stopwords and f(terms) <= 2.
      • Terminological reference: IATE database for 12-Law.
    • MLPLab_UQAM: Bidirectional LSTM with GloVe embeddings on 3 languages.

    • TALN-LS2N: only English, French (described in next paper).

  • Notes:

    • TALN-LS2N’s system outperforms all others in the English and French tracks.
    • NLPLab UQAM’s system outperforms e-Terminology for the Dutch track.
    • Unpredictability of DL models (BERT)
      • Large gap between precision and recall for English model, much smaller for French model.
    • ACTER v1.3
  1. TALN-LS2N System for Automatic Term Extraction.
  • Dataset: Annotated Corpora for Term Extraction Research (ACTER)

    • Training phase: Corruption, dressage, wind energy.
    • Test phase: Heart failure.
    • Languages: : English, French and Dutch.
  • Proposed systems:

Feature-based approaches Context-based approaches
  1. Feature Extraction
    • Linguistic filtering
      • spaCy’s rule-matching engine
    • Candidate describing
      • Linguistic, stylistic, statistic, and distributional descriptors
      • Termhood: degree to which a linguistic unit is related to domain-specific context
      • Measures: Specificity, Term’s relation to Context, Cvalue, Termhood
    • Selection phase
  2. Classification - Boosting
    • sklearn standard scaler
    • eXtreme Gradient Boosting (XGBoost)
  1. Formats:
    • Input: The sentence contains the term
    • Output: The term
  2. Models
    • English: RoBERTa
      • Modify key hyperparams in original BERT
      • Eliminate its next sentence pretraining objective
      • Train the model with much larger mini-batches and more substantial learning rates
    • French: CamemBERT
    • Use pre-trained models and fine-tuned during the classification
  • Notes:

    • BERT outperforms classical methods
    • New, simple and strong baseline for terminology extraction

Contributors: