This repo contains notebooks and other materials used in our NoDaLiDa 2023 paper.
Authors: Ekaterina Artemova and Barbara Plank
-
Use the scripts from the wikipedia2corpus repository to preprocess German and dialect Wikipedias.
-
The notebook 02_align_categories aligns page titles from dialect Wikipedias to German Wikipedia using Wikipedia API Wikipedia-API and segments the Wikipedia pages in sentences.
-
The notebook 03_compute_sentence_similarity, computes pairwise similarity and applies a number of filters to post-process aligned sentences.
-
The notebook 04_output_sentences computes pairwise similarity and applies a number of filters to post-process aligned sentences.
-
Run awesome-aling to run word alignment. See 05_run_aligner parameters to run the awesome-align script.
-
The notebook 06_lexicon_induction outputs the final bilingual lexicons.
-
The notebook sampling examplifies stratified sampling from a frequency dict.
-
The folder labelled_data contains the manually labelled sentence pairs and word pairs.
-
The folder bli_data contains the extracted bitext and word pairs.
-
The folder freq contains frequency dicitionaries built from Bavarian and Alemannic Wikipedias.
@inproceedings{artemova-plank-2023-low,
title = "Low-resource Bilingual Dialect Lexicon Induction with Large Language Models",
author = "Artemova, Ekaterina and Plank, Barbara",
booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa 2023) (NoDaLiDa)",
year = "2023",
}