Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

This repo contains notebooks and other materials used in our NoDaLiDa 2023 paper.

Authors: Ekaterina Artemova and Barbara Plank

Notebooks

Use the scripts from the wikipedia2corpus repository to preprocess German and dialect Wikipedias.
The notebook 02_align_categories aligns page titles from dialect Wikipedias to German Wikipedia using Wikipedia API Wikipedia-API and segments the Wikipedia pages in sentences.
The notebook 03_compute_sentence_similarity, computes pairwise similarity and applies a number of filters to post-process aligned sentences.
The notebook 04_output_sentences computes pairwise similarity and applies a number of filters to post-process aligned sentences.
Run awesome-aling to run word alignment. See 05_run_aligner parameters to run the awesome-align script.
The notebook 06_lexicon_induction outputs the final bilingual lexicons.
The notebook sampling examplifies stratified sampling from a frequency dict.

Data

The folder labelled_data contains the manually labelled sentence pairs and word pairs.
The folder bli_data contains the extracted bitext and word pairs.
The folder freq contains frequency dicitionaries built from Bavarian and Alemannic Wikipedias.

Cite

@inproceedings{artemova-plank-2023-low,
    title = "Low-resource Bilingual Dialect Lexicon Induction with Large Language Models",
    author = "Artemova, Ekaterina  and Plank, Barbara",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa 2023) (NoDaLiDa)",
    year = "2023",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

Notebooks

Data

Cite

Files

README.md

Latest commit

History

README.md

File metadata and controls

Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

Notebooks

Data

Cite