This repository contains both code and submodules used for my Master's Thesis entitled Analyzing Slot and Intent Detection for Upper German Dialects via Zero-Shot Transfer Learning as well as for our related paper Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study which builds upon the thesis and was co-authored by Verena Blaschke and Prof. Barbara Plank .
The origin of the portmanteau BaySIDshot is rooted in the creation of a new additional Bavarian test and validation set depicting the dialect spoken in the Munich region in order to further analyze zero-shot transfer learning performance on slot and intent detection (SID) for this and other Upper German dialects. Both de-by (de-muc in the paper!) test and validation .conll files can be found at the root of this directory and featured in the latest version of xSID. Thus, this work presents a parallel extension to the Upper Bavarian dataset translated and annotated by Winkler et al. (2024), similarly building on and extending the xSID approach and data format initiated by van der Goot et al. (2021b). For running the baseline and extended experiments, recursively cloning all submodules and especially MaChAmp by van der Goot et al. (2021a) is necessary and can be done by using the following command:
git clone https://github.com/XaverKrueckl/BaySIDshot.git --recurse-submodules
As a main approach to analyze and enhance zero-shot SID transfer learning to Upper German dialects, auxiliary task training is performed on four different task types from three Bavarian target language datasets. In concrete, these are a Bavarian Universal Dependencies (UD) set by Blaschke et al. (2024), a Bavarian Named Entity Recognition (NER) set by Peng et al. (2024) and Masked-Language-Modeling (MLM) data taken as preprocessed sentences from Artemova and Plank (2023). Whilst both UD and MLM the data consists of one overall set, the NER data is split into two subsets, referred to according to their source as Wiki and Twitter. In the experiments below, both are automatically merged into one dataset by MaChAmp as they feature the same task type in the configuration files. It has to be mentioned, that the Twitter data is only available for scientific research by contacting the authors. Please always cite the respective data source if used in your work!
In order to recreate both baseline and extended experiment results, running the respective notebooks in Google Colaboratory is required. To do so, getting a Pro subscription on the cloud-based service is recommended for the baseline and required for the extended experiments for which larger and stable GPU ressoures are necessary. Similarly, a Google account with access to Google Drive is suggested in order to save models and outputs out of the runtime environment.
Starting the notebooks will establish a mount on Google Drive and then clones this repository recursively. If required, data preparation scripts from the respecitve scripts directory are run. The created datasets, for which the paths are set accordingly in the configurations for MaChAmp are only present during runtime. If their creation fails, the openly available data is given in a manual data directory. In order to use this data, it has to be pre-processed and the paths to the data in the configuration files need to be adjusted to this directory.
After installing the required modules for MaChAmp, the notebook then checks for GPU access and general operability. For each experiment, the respective configuration and parameter files are inspected before the train command is started. After the fine-tuning process has finished, each resulting model in the log files is saved to Google Drive, carrying the experiment name. Before the prediction on the final model starts, the respecitve evaluation data is prepared via scripts but also available as gold files in the manual data. In a rather complex evaluation cell, a script is being prepared that evaluates the final model, which only needs to be loaded once, on each evaluation file from the prepared set. The predicted output files are saved to the respective model on Google Drive. Similarly, a separate evaluation script is run to get the results in a clear json file containing three objects depending on the extent of the evaluation set. After having run an experiment on multiple random seed, further scripts can be used to get the average over these runs, to turn this into a .csv document for usage in a LaTex tables generator and to produce confusion matrices on the results of intent classification. In their current state, the notebooks run all experiments on random_seed=1234. The two further seeds that were used are 6543 and 8446. These need to be set for each experiment and in the respective model names!
Finally, please find the Tex and Bib files as well as styleguide, figures and the final thesis pdf produced using Overleaf here.
When using this work or data utilized in it, please cite the respective papers! For questions and access to unpublished NER data, please contact me!
Cheers, Xaver