Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

This repo contains notebooks and other materials used in our NoDaLiDa 2023 paper.

Authors: Ekaterina Artemova and Barbara Plank

Notebooks

Use the scripts from the wikipedia2corpus repository to preprocess German and dialect Wikipedias.
The notebook 02_align_categories aligns page titles from dialect Wikipedias to German Wikipedia using Wikipedia API Wikipedia-API and segments the Wikipedia pages in sentences.
The notebook 03_compute_sentence_similarity, computes pairwise similarity and applies a number of filters to post-process aligned sentences.
The notebook 04_output_sentences computes pairwise similarity and applies a number of filters to post-process aligned sentences.
Run awesome-aling to run word alignment. See 05_run_aligner parameters to run the awesome-align script.
The notebook 06_lexicon_induction outputs the final bilingual lexicons.
The notebook sampling examplifies stratified sampling from a frequency dict.

Data

The folder labelled_data contains the manually labelled sentence pairs and word pairs.
The folder bli_data contains the extracted bitext and word pairs.
The folder freq contains frequency dicitionaries built from Bavarian and Alemannic Wikipedias.

Cite

@inproceedings{artemova-plank-2023-low,
    title = "Low-resource Bilingual Dialect Lexicon Induction with Large Language Models",
    author = "Artemova, Ekaterina  and Plank, Barbara",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa 2023) (NoDaLiDa)",
    year = "2023",
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
bli_data		bli_data
freq		freq
labelled_data		labelled_data
.gitignore		.gitignore
02_align_categories.ipynb		02_align_categories.ipynb
03_compute_sentence_similarity.ipynb		03_compute_sentence_similarity.ipynb
04_output_sentences.ipynb		04_output_sentences.ipynb
05_run_aligner.sh		05_run_aligner.sh
06_lexicon_induction.ipynb		06_lexicon_induction.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

Notebooks

Data

Cite

About

Releases

Packages

Languages

mainlp/dialect-BLI

Folders and files

Latest commit

History

Repository files navigation

Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

Notebooks

Data

Cite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages