Skip to content

andrazrepar/terminology-hahn

Repository files navigation

Terminology Extraction

The references are summarized in Summary.md.

ACTER data structures

ACTER
│   README.md
│   sources.txt
│
└───en
│   └───corp
│   |   └───annotations
│   |   |   |   corp_en_terms.ann
│   |   |   |   corp_en_terms_nes.ann
│   |   | 
│   |   └───texts
|   |       └───annotated
│   |       |   corp_en_01.txt
│   |       |   corp_en_02.txt
│   |       |   ...
│   |       |
|   |       └───unannotated
│   |           |   corp_en_03.txt
│   |           |   ...
|   |
│   └───equi (equivalent to "corp")
|   |
│   └───htfl (equivalent to "corp")
|   |
│   └───wind (equivalent to "corp")
|
└───fr (equivalent to "en")
└───nl (equivalent to "en")

The distributions of terms per domain per language are demonstrated in data.exploration.ipynb

Architecture

1. Preprocess data

We consider the problem as sequence labeling, which means the model returns a label for each token. To do that, the labels are converted into:

 B - word is the beginning word in term, 
 I - word is inside the term, 
 O - word is not inside the term. 

For examples:

... greco is the most inclusive existing anti - corruption monitoring mechanism ...
...   O    O  O   O       O       O       O   O     B          B          I     ...
  • Input
    ./ACTER/en/*/*_en_terms.ann
    ./ACTER/en/texts/annotated/*
  • Command
cd models
python prepocess.py
  • Output
    ./preprocessed_data/train.pkl

2. Reformat training data

The training set contains the texts from 3 domains (corp, equi, wind), which is formatted as:

| sentence_id |   words   | labels | 
|   :----:    |   :---:   | :----: | 
|      3      |   greco   |    O   | 
|      3      |     is    |    O   | 
|      3      |    the    |    O   | 
|      3      |    most   |    O   | 
|      3      | inclusive |    O   | 
|      3      |  existing |    O   | 
|      3      |    anti   |    O   | 
|      3      |     -     |    O   | 
|      3      | corruption|    B   | 
|      3      | monitoring|    B   | 
|      3      | mechanishm|    I   | 
  • Input
    ./preprocessed_data/train.pkl
  • Command
cd models
python format_data.py
  • Output
    ./preprocessed_data/train.csv

3. Train model & evaluation

The worflow of our implementation: Workflow

  • Models trained using SimpleTransformers on English dataset on Collab.
    • Variants of BERTs: BERT, RoBERTa, DistiledBERT
    • XLNet

We tested on both monolingual and bilingual pre-trained model and apply a pattern filtering as postprocessing step in inference.

  • Train/val/test:

    • Train/val: texts from 3 domains (corp, equi, wind) with ratio of 80/20
    • Test: heart failure texts
  • Default settings:

    adam_epsilon: float = 1e-8
    early_stopping_metric: str = "eval_loss"
    early_stopping_patience: int = 3
    eval_batch_size: int = 16
    learning_rate: float = 4e-5
    manual_seed: int = 2203
    max_seq_length: int = 512
    num_train_epochs: int = 4
    optimizer: str = "AdamW"
  • Evaluation metrics: The final output is a list of lowercased terms and is compared with a correct term list to calculate Precision, Recall and F1-score.
    • Precision = TP/(TP+FP)
    • Recall = TP/(TP+FN)
    • F1-score = 2 * (Precision * Recall )/ (Precision+Recall)

Discussion

  • Why our models outperform TALN-LS2N on lower-resourced languages such as French, Dutch?

    • ACTER updated version from v1.2 to v1.4.
    **Changes version 1.2 > version 1.3**
    
    * corrected wrong sources in htfl_nl
    * changed heart failure abbreviation to "htfl" to be consistent with four-letter domain abbreviations
    * created Github repository for data + submitted it to CLARIN
    
    
    **Changes version 1.3 > version 1.4**
    
    * applied limited normalisation on both texts and annotations:
        * unicodedata.normalize("NFC", text)
        * normalising all dashes to "-", all single quotes to "'" and all double quotes to '"'
    
    • Sequence labeling > N-gram classification.
    • Tricky training dataset.
  • How good are our model results? What we can do to make it better?

    ...
    
    'candesartan',
    'bet inhibition',
    'intraclass correlation coefficient',
    'cardiopulmonary exercise testing',
    'myocardial extracellular matrix accumulation',
    'implantable cardioverter defibrillator',
    
    ...
    
    ' ',
    '[',
    '-',
    '%',
    
    ...
    
  • Is the data good enough?

    ...
    'hazard ratio',
    'hazard ratios',
    
    ...
    
    'health care',
    'health-care',
    'heart  failure',
    
    ...    
    
    'heart transplant',
    'heart transplantation',
    'heart transplantations'
    
    ...
    
    'implantable cardioverter defibrillator',
    'implantable cardioverter defibrillators',
    'implantable cardioverter-defibrillator',
    'implantable cardioverter-defibrillators',
    
    ...
    
    'rv'
    "s'"
    

Future works

  • Tuning hyperparameters.
  • Add preprocessingg and postprocessing.
  • Experiment on other languages.

References

Contributors:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published