This is the implementation of the paper AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.
- Overview
- GLUE Benchmark
- Checkpoints
- Run the model
- Notes and Acknowledgments
- Contact Information
- Citation
Our experiments on the GLUE benchmark are run on 16 NVIDIA Tesla V100 GPU. The results may vary due to different GPU models, drivers, CUDA SDK versions, floating-point precisions, and random seeds.
We release all copies of Adapter weights for users' Adapter aggregation study.
Dataset | BERT base 110M |
RoBERTa large 355M |
|
---|---|---|---|
MNLI | 8.5 MB | 11.7 MB | |
SST2 | 8.5 MB | 11.7 MB | |
MRPC | 8.5 MB | 11.7 MB | |
CoLA | 8.5 MB | 11.7 MB | |
QNLI | 8.5 MB | 11.7 MB | |
QQP | 8.5 MB | 11.7 MB | |
RTE | 8.5 MB | 11.7 MB | |
STSB | 8.5 MB | 11.7 MB |
conda env create -f environment.yml
pip install -e .
We also provide the shell scripts for bert-base and roberta-large.
export num_gpus=1
export PYTHONHASHSEED=0
task_name=mnli
model=roberta-large
export output_dir="./models/${model}/${task_name}"
python -m torch.distributed.launch --nproc_per_node=$num_gpus \
examples/text-classification/run_glue.py \
--model_name_or_path $model \
--task_name $task_name \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 32 \
--learning_rate 3e-4 \
--num_train_epochs 20 \
--output_dir $output_dir/model \
--overwrite_output_dir \
--logging_steps 1000 \
--logging_dir $output_dir/log \
--evaluation_strategy epoch \
--save_strategy epoch \
--warmup_ratio 0.06 \
--apply_expert_soup \
--adapter_size 16 \
--num_experts 4 \
--seed 0 \
--inference_level 3 \
--weight_decay 0.1 \
--sharing_up 1 \
--sharing_down 0 \
--use_consistency_loss 1
Most arguments are inherited from transformers and are easy to understand. We further explain some of the AdaMix's arguments:
-
inference_level
: There are two suggested modes1
: Random Routing3
: Averaging the weights of Adapters for routing (used in AdaMix)
-
num_experts
: Number of Adapters in AdaMix -
use_consistency_loss
: Two modes.0
: No consistency loss1
: Use consistency loss
-
sharing_up
: There are two modes. (sharing_down is same)0
: No weight sharing1
: Sharing Project-up layer weights in Adapter
Create checkpoints directory and download checkpoints of corresponding tasks under the directory. Use MNLI as an example. Use your checkpoint path in expert_soup_path argument.
export num_gpus=1
export PYTHONHASHSEED=0
task_name=mnli
model=roberta-large
export output_dir="./models/${model}/${task_name}"
python -m torch.distributed.launch --nproc_per_node=$num_gpus \
examples/text-classification/run_glue.py \
--model_name_or_path $model \
--task_name $task_name \
--do_eval \
--expert_soup_path ./checkpoints/pytorch_model_${task_name}_expert_soup.bin \
--max_seq_length 128 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 32 \
--learning_rate 3e-4 \
--num_train_epochs 20 \
--output_dir $output_dir/model \
--overwrite_output_dir \
--logging_steps 1000 \
--logging_dir $output_dir/log \
--evaluation_strategy epoch \
--save_strategy epoch \
--warmup_ratio 0.06 \
--apply_expert_soup \
--adapter_size 16 \
--num_experts 4 \
--seed 0 \
--inference_level 3 \
--weight_decay 0.1 \
--sharing_up 1 \
--sharing_down 0 \
--use_consistency_loss 1
The implementation is based on https://github.com/huggingface/transformers
We also used some code from: https://github.com/microsoft/LoRA
For personal communication related to this package, please contact Yaqing Wang (wang5075@purdue.edu), Sahaj Agarwal (sahagar@microsoft.com), Subhabrata (Subho) Mukherjee (submukhe@microsoft.com) or Xiaodong Liu (xiaodl@microsoft.com).
@article{wang2022adamix,
title={AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning},
author={Wang, Yaqing and Agarwal, Sahaj and Mukherjee, Subhabrata and Liu, Xiaodong and Gao, Jing and Awadallah, Ahmed Hassan and Gao, Jianfeng},
journal={arXiv preprint arXiv:2205.12410},
year={2022}
}