Terminology Extraction

The references are summarized in Summary.md.

ACTER data structures

ACTER
│   README.md
│   sources.txt
│
└───en
│   └───corp
│   |   └───annotations
│   |   |   |   corp_en_terms.ann
│   |   |   |   corp_en_terms_nes.ann
│   |   | 
│   |   └───texts
|   |       └───annotated
│   |       |   corp_en_01.txt
│   |       |   corp_en_02.txt
│   |       |   ...
│   |       |
|   |       └───unannotated
│   |           |   corp_en_03.txt
│   |           |   ...
|   |
│   └───equi (equivalent to "corp")
|   |
│   └───htfl (equivalent to "corp")
|   |
│   └───wind (equivalent to "corp")
|
└───fr (equivalent to "en")
└───nl (equivalent to "en")

The distributions of terms per domain per language are demonstrated in data.exploration.ipynb

Architecture

1. Preprocess data

We consider the problem as sequence labeling, which means the model returns a label for each token. To do that, the labels are converted into:

 B - word is the beginning word in term, 
 I - word is inside the term, 
 O - word is not inside the term.

For examples:

... greco is the most inclusive existing anti - corruption monitoring mechanism ...
...   O    O  O   O       O       O       O   O     B          B          I     ...

Input

    ./ACTER/en/*/*_en_terms.ann
    ./ACTER/en/texts/annotated/*

Command

cd models
python prepocess.py

Output

    ./preprocessed_data/train.pkl

2. Reformat training data

The training set contains the texts from 3 domains (corp, equi, wind), which is formatted as:

| sentence_id |   words   | labels | 
|   :----:    |   :---:   | :----: | 
|      3      |   greco   |    O   | 
|      3      |     is    |    O   | 
|      3      |    the    |    O   | 
|      3      |    most   |    O   | 
|      3      | inclusive |    O   | 
|      3      |  existing |    O   | 
|      3      |    anti   |    O   | 
|      3      |     -     |    O   | 
|      3      | corruption|    B   | 
|      3      | monitoring|    B   | 
|      3      | mechanishm|    I   |

Input

    ./preprocessed_data/train.pkl

Command

cd models
python format_data.py

Output

    ./preprocessed_data/train.csv

3. Train model & evaluation

The worflow of our implementation:

Models trained using SimpleTransformers on English dataset on Collab.
- Variants of BERTs: BERT, RoBERTa, DistiledBERT
- XLNet

We tested on both monolingual and bilingual pre-trained model and apply a pattern filtering as postprocessing step in inference.

Train/val/test:
- Train/val: texts from 3 domains (corp, equi, wind) with ratio of 80/20
- Test: heart failure texts
Default settings:

    adam_epsilon: float = 1e-8
    early_stopping_metric: str = "eval_loss"
    early_stopping_patience: int = 3
    eval_batch_size: int = 16
    learning_rate: float = 4e-5
    manual_seed: int = 2203
    max_seq_length: int = 512
    num_train_epochs: int = 4
    optimizer: str = "AdamW"

Evaluation metrics: The final output is a list of lowercased terms and is compared with a correct term list to calculate Precision, Recall and F1-score.
- Precision = TP/(TP+FP)
- Recall = TP/(TP+FN)
- F1-score = 2 * (Precision * Recall )/ (Precision+Recall)

Discussion

Why our models outperform TALN-LS2N on lower-resourced languages such as French, Dutch?

ACTER updated version from v1.2 to v1.4.

**Changes version 1.2 > version 1.3**

* corrected wrong sources in htfl_nl
* changed heart failure abbreviation to "htfl" to be consistent with four-letter domain abbreviations
* created Github repository for data + submitted it to CLARIN


**Changes version 1.3 > version 1.4**

* applied limited normalisation on both texts and annotations:
    * unicodedata.normalize("NFC", text)
    * normalising all dashes to "-", all single quotes to "'" and all double quotes to '"'

Sequence labeling > N-gram classification.
Tricky training dataset.

How good are our model results? What we can do to make it better?

...

'candesartan',
'bet inhibition',
'intraclass correlation coefficient',
'cardiopulmonary exercise testing',
'myocardial extracellular matrix accumulation',
'implantable cardioverter defibrillator',

...

' ',
'[',
'-',
'%',

...

Is the data good enough?

...
'hazard ratio',
'hazard ratios',

...

'health care',
'health-care',
'heart  failure',

...    

'heart transplant',
'heart transplantation',
'heart transplantations'

...

'implantable cardioverter defibrillator',
'implantable cardioverter defibrillators',
'implantable cardioverter-defibrillator',
'implantable cardioverter-defibrillators',

...

'rv'
"s'"

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ACTER		ACTER
DS5		DS5
architecture		architecture
cache_dir		cache_dir
core_model		core_model
eda		eda
models		models
processed_data		processed_data
references		references
results		results
runs/Jun10_16-46-47_Andrazs-MacBook-Pro.local		runs/Jun10_16-46-47_Andrazs-MacBook-Pro.local
termhood		termhood
training_corpus		training_corpus
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
Regressors.ipynb		Regressors.ipynb
ann_train_inl.csv		ann_train_inl.csv
en_iate_filtering.ipynb		en_iate_filtering.ipynb
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terminology Extraction

ACTER data structures

Architecture

1. Preprocess data

2. Reformat training data

3. Train model & evaluation

Discussion

Future works

References

Contributors:

About

Releases

Packages

Languages

andrazrepar/terminology-hahn

Folders and files

Latest commit

History

Repository files navigation

Terminology Extraction

ACTER data structures

Architecture

1. Preprocess data

2. Reformat training data

3. Train model & evaluation

Discussion

Future works

References

Contributors:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages