Skip to content

mainlp/dialect-BLI

Repository files navigation

Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

This repo contains notebooks and other materials used in our NoDaLiDa 2023 paper.

Authors: Ekaterina Artemova and Barbara Plank

Notebooks

  1. Use the scripts from the wikipedia2corpus repository to preprocess German and dialect Wikipedias.

  2. The notebook 02_align_categories aligns page titles from dialect Wikipedias to German Wikipedia using Wikipedia API Wikipedia-API and segments the Wikipedia pages in sentences.

  3. The notebook 03_compute_sentence_similarity, computes pairwise similarity and applies a number of filters to post-process aligned sentences.

  4. The notebook 04_output_sentences computes pairwise similarity and applies a number of filters to post-process aligned sentences.

  5. Run awesome-aling to run word alignment. See 05_run_aligner parameters to run the awesome-align script.

  6. The notebook 06_lexicon_induction outputs the final bilingual lexicons.

  7. The notebook sampling examplifies stratified sampling from a frequency dict.

Data

  1. The folder labelled_data contains the manually labelled sentence pairs and word pairs.

  2. The folder bli_data contains the extracted bitext and word pairs.

  3. The folder freq contains frequency dicitionaries built from Bavarian and Alemannic Wikipedias.

Cite

@inproceedings{artemova-plank-2023-low,
    title = "Low-resource Bilingual Dialect Lexicon Induction with Large Language Models",
    author = "Artemova, Ekaterina  and Plank, Barbara",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa 2023) (NoDaLiDa)",
    year = "2023",
}

About

Dialect-BLI project repo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published