Team member: Yuchen Wang (NYU Shanghai), Zhengye Zhu (Peking University).
This is the implementation of the 4th place solution of the chaii - Hindi and Tamil Question Answering competition at kaggle.
Our solution write-up: https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/287911.
Dataset we made (not involved in the final submission): hi/ta parsed wiki, SQuAD 2.0 in Tamil, cleaned chaii dataset.
-
The environment is the same as a Kaggle Docker. Install dependencies with
pip install -r requirements.txt
. You will need a single RTX3090 or A10 and at least 16 GB memory. -
To leverage zero-shot transferability, finetune RemBERT, InfoXLM, Muril, XLM-R on SQuAD 2.0 with
finetune.py
. An example of finetuning Muril:python finetune.py -u \ --model_checkpoint google/muril-large-cased \ --train_path <path to data>/train-v2.0.json \ --max_length 512 \ --doc_stride 128 \ --epochs 2 \ --batch_size 4 \ --accumulation_steps 8 \ --lr 1e-5 \ --weight_decay 0.01 \ --warmup_ratio 0.2 \ --seed 42 \ --dropout 0.1
Substitute
model_checkpoint
with corresponding Huggingface pre-trained checkpoint for other models. Set epochs = 3 for RemBERT, InfoXLM, XLM-R, leaving other hyper-parameters the same. -
As decribed in our solution write-up, we trained models with corss-validation or with all data. You can train 5-fold models on the chaii + XQuAD + MLQA dataset with
train-cv.py
OR train with all data withtrain-all.py
. Please first download our cleaned data here.-
An example of training 5 folds Muril, substitute
model_checkpoint
for the others:python -u train-native-stepeval.py \ --model_checkpoint google/muril-large-cased \ --train_path <path to data>/merged0917.csv \ --max_length 512 \ --doc_stride 128 \ --epochs 3 \ --batch_size 4 \ --accumulation_steps 1 \ --lr 1e-5 \ --optimizer adamw \ --weight_decay 0.0 \ --scheduler cosann \ --warmup_ratio 0.1 \ --dropout 0.1 \ --eval_steps 1000 \ --metric nonzero_jaccard_per \ --downext \ --seed 42
-
An example of training Muril with all data, substitute
model_checkpoint
for the others:python -u train-useall.py \ --model_checkpoint google/muril-large-cased \ --train_path <path to data>/merged0917.csv \ --max_length 512 \ --doc_stride 128 \ --epochs 3 \ --batch_size 4 \ --accumulation_steps 1 \ --lr 1e-5 \ --weight_decay 0.0 \ --warmup_ratio 0.1 \ --seed 42 \ --dropout 0.1 \ --downsample 0.5
-
Although we didn't find the translated SQuAD dataset useful, you may try to train on it with
train-enta.py
on SQuAD 2.0 in Tamil.
-
-
Infer with ensembling and post-processing: https://www.kaggle.com/zacchaeus/chaii-infer-blend-postpro-4models.