Skip to content

Latest commit

 

History

History
44 lines (24 loc) · 2.39 KB

README.md

File metadata and controls

44 lines (24 loc) · 2.39 KB

Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

This repo contains notebooks and other materials used in our NoDaLiDa 2023 paper.

Authors: Ekaterina Artemova and Barbara Plank

Notebooks

  1. Use the scripts from the wikipedia2corpus repository to preprocess German and dialect Wikipedias.

  2. The notebook 02_align_categories aligns page titles from dialect Wikipedias to German Wikipedia using Wikipedia API Wikipedia-API and segments the Wikipedia pages in sentences.

  3. The notebook 03_compute_sentence_similarity, computes pairwise similarity and applies a number of filters to post-process aligned sentences.

  4. The notebook 04_output_sentences computes pairwise similarity and applies a number of filters to post-process aligned sentences.

  5. Run awesome-aling to run word alignment. See 05_run_aligner parameters to run the awesome-align script.

  6. The notebook 06_lexicon_induction outputs the final bilingual lexicons.

  7. The notebook sampling examplifies stratified sampling from a frequency dict.

Data

  1. The folder labelled_data contains the manually labelled sentence pairs and word pairs.

  2. The folder bli_data contains the extracted bitext and word pairs.

  3. The folder freq contains frequency dicitionaries built from Bavarian and Alemannic Wikipedias.

Cite

@inproceedings{artemova-plank-2023-low,
    title = "Low-resource Bilingual Dialect Lexicon Induction with Large Language Models",
    author = "Artemova, Ekaterina  and Plank, Barbara",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa 2023) (NoDaLiDa)",
    year = "2023",
}