GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages

Code for GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages, published at SIGTYP 2024.

Our presentation

(Click to open the YouTube video) (11:26 min)

Overview

Existing (black) semantic domain dictionary entries in FLEx and correct (green) and incorrect (red) new predictions: The upper image shows three entries that GUIDE added to the English dictionary and the lower image shows seven entries for the same semantic domain question in the newly created Mina-Gen dictionary.

Requirements

You need the Conda package manager to install the requirements. Furthermore, you need Git LFS to setup the repository. To install the requirements:

chmod +x setup.sh
./setup.sh

This setup has been tested on an ASUS machine ESC8000 G4 with Ubuntu 22.04.

Preprocessing Pipeline

You can skip the preprocessing and directly start to train the model by using the prepared file final_mag.cpickle. (MAG stands for "Multilingual Alignment Graph".) If you want to reproduce the preprocessing, run:

conda activate guide_env
python -m src.preprocess --output-directory data/0_state/ && python -m src.gnn.refine_mag --input-mag-directory data/0_state/ --output-mag-file final_mag.cpickle

Note that the preprocessing does not include the Igbo and Gen-Mina languages because the source Bible translations are copyrighted.

Training

To train GUIDE, run this command:

conda activate guide_env
CUDA_VISIBLE_DEVICES=0 python -m src.gnn.train --input-mag-file final_mag.cpickle --output-model-file my_trained_model.bin --output-data-split-file my_data_split.bin

Evaluation

To evaluate GUIDE, run:

conda activate guide_env
CUDA_VISIBLE_DEVICES=0 python -m src.gnn.eval --input-mag-file final_mag.cpickle --input-model-file my_trained_model.bin --input-data-split-file my_data_split.bin --output-results-file my_results.json

The data split file and model file will be created during training.

Pre-trained Model

model.bin contains the pretrained model, trained with a batch size of 6,000, a learning rate of 0.05, and early stopping patience after 5 epochs with a warmup of 30 epochs.

Results

Our work establishes a new benchmark for linking words to their semantic domain questions. To the best of our knowledge, we propose the first automated approach to address this task.

Evaluation/Language	Precision	Recall	F1	Manual Precision	# Predicted links
	Evaluation with dataset			Manual
Random baseline	0.00	0.500	0.000	n/a	741,033,563

DEVELOPMENT
Bengali	0.22 ± 0.11	0.002 ± 0.001	0.004 ± 0.003	0.56	2,809 (2,770)
Chinese (simplified)	0.17 ± 0.02	0.014 ± 0.002	0.026 ± 0.004	0.34	5,752 (5,036)
English	0.63 ± 0.02	0.125 ± 0.006	0.208 ± 0.009	0.86	7,119 (2,314)
French	0.59 ± 0.03	0.097 ± 0.005	0.167 ± 0.008	0.78	6,993 (2,527)
Hindi	0.25 ± 0.02	0.029 ± 0.003	0.051 ± 0.006	0.78	3,914 (2,835)
Indonesian	0.34 ± 0.05	0.035 ± 0.005	0.064 ± 0.009	0.77	1,799 (1,068)
Kupang Malay	0.14 ± 0.05	0.013 ± 0.005	0.024 ± 0.009	0.79	1,440 (1,351)
Malayalam	0.10 ± 0.03	0.015 ± 0.004	0.026 ± 0.007	0.43	2,768 (2,480)
Nepali	0.20 ± 0.01	0.022 ± 0.002	0.039 ± 0.004	0.38	2,641 (2,156)
Portuguese	0.43 ± 0.02	0.088 ± 0.006	0.146 ± 0.009	0.86	6,759 (3,737)
Spanish	0.59 ± 0.02	0.090 ± 0.005	0.155 ± 0.008	0.84	7,614 (3,579)
Swahili	0.33 ± 0.04	0.018 ± 0.003	0.033 ± 0.005	0.75	2,320 (2,020)

TEST
German	n/a	n/a	n/a	0.67	5,022
Hiri Motu	n/a	n/a	n/a	0.62	1,190
Igbo	n/a	n/a	n/a	0.45	1,405
Mina-Gen	n/a	n/a	n/a	0.80	3,063
Motu	n/a	n/a	n/a	0.32	2,731
South Azerbaijani	n/a	n/a	n/a	0.58	2,238
Tok Pisin	n/a	n/a	n/a	0.69	880
Yoruba	n/a	n/a	n/a	0.63	2,637

AVERAGES
Development set	0.33 ± 0.04	0.046 ± 0.004	0.079 ± 0.007	0.68 ± 0.19	4,327 ± 2,338
Test set	n/a	n/a	n/a	0.60 ± 0.15	2,396 ± 1,324
Stanza	0.43 ± 0.02	0.068 ± 0.005	0.117 ± 0.008	0.74 ± 0.17	5,622 ± 1,975
SentencePiece	0.21 ± 0.05	0.014 ± 0.003	0.026 ± 0.005	0.53 ± 0.13	2,364 ± 524
Punctuation mark split	0.14 ± 0.05	0.013 ± 0.005	0.024 ± 0.009	0.64 ± 0.18	1,990 ± 927
Total	0.33 ± 0.04	0.046 ± 0.004	0.079 ± 0.007	0.65 ± 0.18	3,555 ± 2,180

Contributing

If you would like to contribute, find bugs, or have any suggestions for this project, you can contact me at jonathan.janetzki@student.hpi.de or open an issue on this GitHub repository.

All contributions are welcome. All content in this repository is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
data		data
results		results
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
align_bibles.sh		align_bibles.sh
environment.yml		environment.yml
final_mag.cpickle		final_mag.cpickle
model.bin		model.bin
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages

Our presentation

Overview

Requirements

Preprocessing Pipeline

Training

Evaluation

Pre-trained Model

Results

Contributing

About

Releases

Packages

Languages

License

janetzki/GUIDE

Folders and files

Latest commit

History

Repository files navigation

GUIDE: Creating Semantic Domain Dictionaries for Low-Resource Languages

Our presentation

Overview

Requirements

Preprocessing Pipeline

Training

Evaluation

Pre-trained Model

Results

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages