The references are summarized in Summary.md.
ACTER
│ README.md
│ sources.txt
│
└───en
│ └───corp
│ | └───annotations
│ | | | corp_en_terms.ann
│ | | | corp_en_terms_nes.ann
│ | |
│ | └───texts
| | └───annotated
│ | | corp_en_01.txt
│ | | corp_en_02.txt
│ | | ...
│ | |
| | └───unannotated
│ | | corp_en_03.txt
│ | | ...
| |
│ └───equi (equivalent to "corp")
| |
│ └───htfl (equivalent to "corp")
| |
│ └───wind (equivalent to "corp")
|
└───fr (equivalent to "en")
└───nl (equivalent to "en")
The distributions of terms per domain per language are demonstrated in data.exploration.ipynb
We consider the problem as sequence labeling, which means the model returns a label for each token. To do that, the labels are converted into:
B - word is the beginning word in term,
I - word is inside the term,
O - word is not inside the term.
For examples:
... greco is the most inclusive existing anti - corruption monitoring mechanism ...
... O O O O O O O O B B I ...
- Input
./ACTER/en/*/*_en_terms.ann
./ACTER/en/texts/annotated/*
- Command
cd models
python prepocess.py
- Output
./preprocessed_data/train.pkl
The training set contains the texts from 3 domains (corp, equi, wind), which is formatted as:
| sentence_id | words | labels |
| :----: | :---: | :----: |
| 3 | greco | O |
| 3 | is | O |
| 3 | the | O |
| 3 | most | O |
| 3 | inclusive | O |
| 3 | existing | O |
| 3 | anti | O |
| 3 | - | O |
| 3 | corruption| B |
| 3 | monitoring| B |
| 3 | mechanishm| I |
- Input
./preprocessed_data/train.pkl
- Command
cd models
python format_data.py
- Output
./preprocessed_data/train.csv
The worflow of our implementation:
- Models trained using SimpleTransformers on English dataset on Collab.
- Variants of BERTs: BERT, RoBERTa, DistiledBERT
- XLNet
We tested on both monolingual and bilingual pre-trained model and apply a pattern filtering as postprocessing step in inference.
-
Train/val/test:
- Train/val: texts from 3 domains (corp, equi, wind) with ratio of 80/20
- Test: heart failure texts
-
Default settings:
adam_epsilon: float = 1e-8
early_stopping_metric: str = "eval_loss"
early_stopping_patience: int = 3
eval_batch_size: int = 16
learning_rate: float = 4e-5
manual_seed: int = 2203
max_seq_length: int = 512
num_train_epochs: int = 4
optimizer: str = "AdamW"
- Evaluation metrics: The final output is a list of lowercased terms and is compared with a correct term list to calculate Precision, Recall and F1-score.
- Precision = TP/(TP+FP)
- Recall = TP/(TP+FN)
- F1-score = 2 * (Precision * Recall )/ (Precision+Recall)
-
Why our models outperform TALN-LS2N on lower-resourced languages such as French, Dutch?
- ACTER updated version from v1.2 to v1.4.
**Changes version 1.2 > version 1.3** * corrected wrong sources in htfl_nl * changed heart failure abbreviation to "htfl" to be consistent with four-letter domain abbreviations * created Github repository for data + submitted it to CLARIN **Changes version 1.3 > version 1.4** * applied limited normalisation on both texts and annotations: * unicodedata.normalize("NFC", text) * normalising all dashes to "-", all single quotes to "'" and all double quotes to '"'
- Sequence labeling > N-gram classification.
- Tricky training dataset.
-
How good are our model results? What we can do to make it better?
... 'candesartan', 'bet inhibition', 'intraclass correlation coefficient', 'cardiopulmonary exercise testing', 'myocardial extracellular matrix accumulation', 'implantable cardioverter defibrillator', ... ' ', '[', '-', '%', ...
-
Is the data good enough?
... 'hazard ratio', 'hazard ratios', ... 'health care', 'health-care', 'heart failure', ... 'heart transplant', 'heart transplantation', 'heart transplantations' ... 'implantable cardioverter defibrillator', 'implantable cardioverter defibrillators', 'implantable cardioverter-defibrillator', 'implantable cardioverter-defibrillators', ... 'rv' "s'"
- Tuning hyperparameters.
- Add preprocessingg and postprocessing.
- Experiment on other languages.