This repository contains data and code for the paper "Data-Augmentation based Dialectal Adaptation for LLMs": GMUNLP's submission to the Dialect-Copa shared task organized by the VARDIAL 2024 Workshop.
To get started, clone this repository to your local machine using the following command:
git clone https://github.com/ffaisal93/dialect_copa.git
cd dialect_copa
To install the necessary dependencies, run the following command:
./install.sh --task vnv_copa
This script will set up the required packages and libraries for the vnv_copa task.
Before running any experiments, activate the virtual environment by navigating to the dialect_copa
directory and sourcing the vnv_copa
environment:
source vnv/vnv_copa
-
Preprocess the training data to create various data combinations by running the following commands:
cd scripts python scripts/dataset_preprocess.py cd ..
This operation will create the data settings as shown in the table below. The "Original names reported in Paper" column corresponds to the names used in Table 4 of the paper, while the "In data directory" column represents the corresponding dataset names in the
data
directory.Original names reported in Paper In data directory o all_train otrsl orgl otrslc orglc otrsl_{mk-hr-ckm} ormg_mk_hr_ckm otrsl_{hr-ckm} orgl_hr_ckm otrsl_{sl-cer} orgl_sl_cer otrsl_{sr-tor} orgl_sr_tor otrsl_{mix} orgl_omix otrslc_{mix} orglc_omix otrsl_{mix-mk-hr-ckm} orgl_omix_mk_hr_ckm otrsl_{mix-hr-ckm} orgl_omix_hr_ckm otrsl_{mix-sr-tor} orgl_omix_sr_tor otrsl_{mix-sl-cer} orgl_omix_sl_cer validation dataset all_val -
Additional python code to create the reverse-augmented data (already performed)
import pandas import json import os def get_reverse(x): label_dict={0:'choice1',1:'choice2'} premise=x[label_dict[x['label']]] x[label_dict[x['label']]]=x['premise'] x['premise']=premise if x['question']=='cause': x['question']='effect' else: x['question']='cause' return x DATADIR="../data" for f in os.listdir(DATADIR): if str(f).startswith('copa') and str(f).endswith("hr-ckm"): for f1 in os.listdir(os.path.join(DATADIR,f)): if str(f1).startswith('train.jsonl'): print(f,f1) with open(os.path.join(DATADIR,f,f1), encoding="utf-8") as f2: lines = f2.read().splitlines() line_dicts = [json.loads(line) for line in lines] print(os.listdir(os.path.join(DATADIR,f))) reverse_line_dicts=[get_reverse(line.copy()) for line in line_dicts] dest_file=os.path.join(DATADIR,f,'train-reverse.jsonl' ) output_file = open(dest_file, 'w',encoding='utf-8') for dic in reverse_line_dicts: json.dump(dic, output_file,ensure_ascii=False) output_file.write("\n") output_file.close()
-
Translate using the ctranslate2 package (already performed, not needed):
cd scripts python translate.py cd ..
-
We perform fine-tuning on the
bcms-bertic
model. -
We evaluate the fine-tuned model using the Dialect-Copa
validation data
located indata/all_val
.To run the fine-tuning and evaluation, execute the following command:
./install.sh --task train_predict_all_copa_encoder
-
If you have access to the test data and want to use it instead of the validation data, follow these steps:
- First, prepare the test data dataset-dict using the provided Python code snippet.
from datasets import load_dataset, Dataset, DatasetDict, load_from_disk import pandas as pd import os import json import datasets, random def make_all_test_data(DATADIR): datasets={} for f in os.listdir(DATADIR): if str(f).startswith('copa'): for f1 in os.listdir(os.path.join(DATADIR,f)): if str(f1).startswith('test'): print(f,f1) with open(os.path.join(DATADIR,f,f1), encoding="utf-8") as f2: lines = f2.read().splitlines() line_dicts = [json.loads(line) for line in lines] datasets[str(f)]=line_dicts datasets[str(f)] = Dataset.from_pandas(pd.DataFrame(data=datasets[str(f)])) datasets = DatasetDict(datasets) # datasets=datasets.shuffle(seed=41) return datasets DATADIR='../data/dialect-copa-test' test_data=make_all_test_data(DATADIR) test_data.save_to_disk(os.path.join(DATADIR,'all_test')) test_data=load_from_disk(os.path.join(DATADIR,'all_test'))
- Then, change the
TEST_FILE
variable in theinstall.sh
file. Locate the lineTEST_FILE="data/all_val"
within theif [[ "$task" = "train_predict_all_copa_encoder" || "$task" = "all" ]]; then
block and replace it withTEST_FILE="data/dialect-copa-test/all_test"
.
- First, prepare the test data dataset-dict using the provided Python code snippet.
To perform LORA fine-tuning on the Aya-101 model, run the following command:
./install.sh --task train_all_copa_aya_test
-
Modify line 25 in the
test_aya_copa.py
script to specify the location of the LORA-tuned model:peft_model_id = "output_models/lora/orgl_mk_hr_ckm_test"
-
Run the evaluation script:
mkdir results python scripts/test_aya_copa.py
To evaluate all available test and validation data using the base Aya-101 model, LORA-tuned models, and fine-tuned Bertic models, run the following command:
./install.sh --task run_copa_test_all
Note: You need to have access to the test data and create the all_test
file beforehand.
Refer to the if [[ "$task" = "run_copa_test_all" || "$task" = "all" ]]; then
block in the install.sh
file for the complete command.
If you use the data or code from this repository in your publication or study, please cite the following dataset and code papers:
@inproceedings{faisal-anastasopoulos-2024-augmentation,
title = "Data-Augmentation based Dialectal Adaptation for LLMs",
author = {Faisal, Fahim and Anastasopoulos, Antonios},
booktitle = "Eleventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2024)",
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
}
@inproceedings{chifu-etal-2024-vardial,
title = "Var{D}ial Evaluation Campaign 2024: Commonsense Reasoning in Dialects and Multi-Label Similar Language Identification",
author = {Chifu, Adrian and Glava\v{s}, Goran and Ionescu, Radu and Ljube\v{s}i\'{c}, Nikola and Mileti\'{c}, Aleksandra and Mileti\'{c}, Filip and Scherrer, Yves and Vuli\'{c}, Ivan},
editor = {Scherrer, Yves and Jauhiainen, Tommi and Ljube{\v{s}}i{\'c}, Nikola and Nakov, Preslav and Tiedemann, J{\"o}rg and Zampieri, Marcos},
booktitle = "Eleventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2024)",
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
}
@inproceedings{ljubesic-etal-2024-dialect,
title = "{DIALECT-COPA}: Extending the Standard Translations of the {COPA} Causal Commonsense Reasoning Dataset to South Slavic Dialects",
author = {Ljube\v{s}i\'{c}, Nikola and Galant, Nada and Ben\v{c}ina, Sonja and \v{C}ibej, Jaka and Milosavljevi\'{c}, Stefan and Rupnik, Peter and Kuzman, Taja},
editor = {Scherrer, Yves and Jauhiainen, Tommi and Ljube{\v{s}}i{\'c}, Nikola and Nakov, Preslav and Tiedemann, J{\"o}rg and Zampieri, Marcos},
booktitle = "Eleventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2024)",
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
}