Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation
Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, Jingbo Zhu
Our Model "AlpaCaR" is pronounced as "/ˈælpəˈkɑːr/". The logo is generated by DALL·E 3.
- [2024.02] We release our 📄paper. If you have any questions about our project, please send email to geyuanqaq@gmail.com
- [2024.09] CaR is accepted by EMNLP 2024 Main!🎉🎉🎉
conda create --name car python=3.8
conda activate car
pip install poetry
poetry install
Download IQS or Comet model from Huggingface Link, and save it under /CaR/Ranking/lightning_logs/.
Default setting
python Ranking/split_IQS.py --batch_size=128
Using other instruction file
python Ranking/split_IQS.py --input='XX.json'
'XX.json' needs to be in the format of 'alpaca_data.json'.
Default setting
python Clustering/cluster.py
Using other instruction file with score
python Clustering/cluster.py --input='XX.json'
'XX.json' needs to be in the format of './data/ranking_IQS_data.json'.
Instead of using pretrained models your can train your own model with the following command:
comet-train --cfg configs/models/{your_model_config}.yaml
Specific yaml parameters of IQS
instruction_metric:
class_path: comet.models.InstructionMetric
init_args:
nr_frozen_epochs: 0.3
keep_embeddings_frozen: True
optimizer: AdamW
encoder_learning_rate: 1.0e-06
learning_rate: 1.5e-05
layerwise_decay: 0.95
encoder_model: XLM-RoBERTa
pretrained_model: xlm-roberta-large
pool: avg
layer: mix
layer_transformation: sparsemax
layer_norm: False
loss: mse
dropout: 0.1
batch_size: 8
train_data:
- data/APE_score_train.csv
validation_data:
- data/APE_score_valid.csv
hidden_sizes:
- 2048
- 1024
activations: Tanh
trainer: ../trainer.yaml
early_stopping: ../early_stopping.yaml
model_checkpoint: ../model_checkpoint.yaml
Training data format of IQS can be found under /CaR/Ranking/data/expert-revised, and Comet under /CaR/Ranking/data/expert-revised-comet.
If you find our paper useful, please consider citing:
@inproceedings{ge-etal-2024-clustering,
title = "Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation",
author = "Ge, Yuan and
Liu, Yilun and
Hu, Chi and
Meng, Weibin and
Tao, Shimin and
Zhao, Xiaofeng and
Xia, Mahong and
Li, Zhang and
Chen, Boxing and
Yang, Hao and
Li, Bei and
Xiao, Tong and
Zhu, JingBo",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.28",
pages = "464--478",
abstract = "With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required by training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR consists of two steps. The first step involves ranking instruction pairs using a scoring model that is well aligned with expert preferences (achieving an accuracy of 84.25{\%}). The second step involves preserving dataset diversity through a clustering process. In our experiment, CaR selected a subset containing only 1.96{\%} of Alpaca{'}s IT data, yet the underlying AlpaCaR model trained on this subset outperforms Alpaca by an average of 32.1{\%} in GPT-4 evaluations. Furthermore, our method utilizes small models (550M parameters) and requires only 11.2{\%} of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios.",
}