Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, Jingbo Zhu

Our Model "AlpaCaR" is pronounced as "/ˈælpəˈkɑːr/". The logo is generated by DALL·E 3.

News💡

[2024.02] We release our 📄paper. If you have any questions about our project, please send email to geyuanqaq@gmail.com
[2024.09] CaR is accepted by EMNLP 2024 Main!🎉🎉🎉

Quick Installation ⚙️

conda create --name car python=3.8
conda activate car
pip install poetry
poetry install

Usage 🛠

Ranking

Download IQS or Comet model from Huggingface Link, and save it under /CaR/Ranking/lightning_logs/.

Default setting

python Ranking/split_IQS.py --batch_size=128

Using other instruction file

python Ranking/split_IQS.py --input='XX.json'

'XX.json' needs to be in the format of 'alpaca_data.json'.

Clustering

Default setting

python Clustering/cluster.py

Using other instruction file with score

python Clustering/cluster.py --input='XX.json'

'XX.json' needs to be in the format of './data/ranking_IQS_data.json'.

Training of Ranking Model 📜

Instead of using pretrained models your can train your own model with the following command:

comet-train --cfg configs/models/{your_model_config}.yaml

Specific yaml parameters of IQS

instruction_metric:
  class_path: comet.models.InstructionMetric
  init_args:
    nr_frozen_epochs: 0.3
    keep_embeddings_frozen: True
    optimizer: AdamW
    encoder_learning_rate: 1.0e-06
    learning_rate: 1.5e-05
    layerwise_decay: 0.95
    encoder_model: XLM-RoBERTa
    pretrained_model: xlm-roberta-large
    pool: avg
    layer: mix
    layer_transformation: sparsemax
    layer_norm: False
    loss: mse
    dropout: 0.1
    batch_size: 8
    train_data: 
      - data/APE_score_train.csv
    validation_data: 
      - data/APE_score_valid.csv
    hidden_sizes:
      - 2048
      - 1024
    activations: Tanh
      
trainer: ../trainer.yaml
early_stopping: ../early_stopping.yaml
model_checkpoint: ../model_checkpoint.yaml

Training data format of IQS can be found under /CaR/Ranking/data/expert-revised, and Comet under /CaR/Ranking/data/expert-revised-comet.

Citation

If you find our paper useful, please consider citing:

@inproceedings{ge-etal-2024-clustering,
    title = "Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation",
    author = "Ge, Yuan  and
      Liu, Yilun  and
      Hu, Chi  and
      Meng, Weibin  and
      Tao, Shimin  and
      Zhao, Xiaofeng  and
      Xia, Mahong  and
      Li, Zhang  and
      Chen, Boxing  and
      Yang, Hao  and
      Li, Bei  and
      Xiao, Tong  and
      Zhu, JingBo",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.28",
    pages = "464--478",
    abstract = "With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required by training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR consists of two steps. The first step involves ranking instruction pairs using a scoring model that is well aligned with expert preferences (achieving an accuracy of 84.25{\%}). The second step involves preserving dataset diversity through a clustering process. In our experiment, CaR selected a subset containing only 1.96{\%} of Alpaca{'}s IT data, yet the underlying AlpaCaR model trained on this subset outperforms Alpaca by an average of 32.1{\%} in GPT-4 evaluations. Furthermore, our method utilizes small models (550M parameters) and requires only 11.2{\%} of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
Clustering		Clustering
Ranking		Ranking
data		data
pic		pic
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, Jingbo Zhu

News💡

Quick Installation ⚙️

Usage 🛠

Ranking

Clustering

Training of Ranking Model 📜

Citation

About

Releases

Packages

Languages

IronBeliever/CaR

Folders and files

Latest commit

History

Repository files navigation

Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Boxing Chen, Hao Yang, Bei Li, Tong Xiao, Jingbo Zhu

News💡

Quick Installation ⚙️

Usage 🛠

Ranking

Clustering

Training of Ranking Model 📜

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages