(NeurIPS 2024 Oral) Aligner: Efficient Alignment by
Learning to Correct

This repository contains the source code for our NeurIPS 2024 paper Aligner: Efficient Alignment by Learning to Correct.

Jiaming Ji*, Boyuan Chen*, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Tianyi Qiu and Yaodong Yang

Work done by PKU-Alignment Team

Abstract

With the rapid development of large language models (LLMs) and ever-evolving practical requirements, finding an efficient and effective alignment method has never been more critical. However, the tension between the complexity of current alignment methods and the need for rapid iteration in deployment scenarios necessitates the development of a model-agnostic alignment approach that can operate under these constraints. In this paper, we introduce Aligner, a novel and simple alignment paradigm that learns the correctional residuals between preferred and dispreferred answers using a small model. Designed as a model-agnostic, plug-and-play module, Aligner can be directly applied to various open-source and API-based models with only one-off training, making it suitable for rapid iteration. Notably, Aligner can be applied to any powerful, large-scale upstream models. Moreover, it can even iteratively bootstrap the upstream models using corrected responses as synthetic human preference data, breaking through the model's performance ceiling. Our experiments demonstrate performance improvements by deploying the same Aligner model across 11 different LLMs, evaluated on the 3H dimensions (helpfulness, harmlessness, and honesty). Specifically, Aligner-7B has achieved an average improvement of 68.9% in helpfulness and 23.8% in harmlessness across the tested LLMs while also effectively reducing hallucination. In the Alpaca-Eval leaderboard, stacking Aligner-2B on GPT-4 Turbo improved its LC Win Rate from 55.0% to 58.3%, surpassing GPT-4 Omni's 57.5% Win Rate (community report).

See our website for more details : https://pku-aligner.github.io/

Aligner: Efficient Alignment by Learning to Correct

Architecture of the Aligner module.

As a plug-and-play module Aligner stack upon an upstream LLM. The Aligner redistributes initial answers from the upstream model into more helpful and harmless answers, thus aligning the composed LLM responses with human intentions.

Illustration of its behavior in architecture and semantic space.

Like a residual block that adds modifications via a shortcut without altering the base structure, the Aligner employs a copy and correct method to improve the original answer. This analogy highlights the Aligner's dual role in preserving the parameter of the upstream model while enhancing it to align with desired outcomes.

Performance of Aligner Models

It is shown that Aligner achieves significant performances in all the settings. All assessments in this table were conducted based on integrating various models with Aligners to compare with the original models to quantify the percentage increase in the 3H standard. When integrated and assessed in conjunction with various upstream models, the Aligner requires only a single training session (i.e., the Aligner can operate in a zero-shot manner and enhance the performance of all upstream models.)

More Details

For more details, please refer to our website

Installation

Clone the source code from GitHub:

git clone https://github.com/cby-pku/aligner.git
cd aligner

Native Runner: Setup a conda environment using conda / mamba:

conda env create --file conda-recipe.yaml  # or `mamba env create --file conda-recipe.yaml`

Training

aligner supports a complete pipeline for Aligner residual correction training.

Follow the instructions in section Installation to setup the training environment properly.

conda activate aligner
export WANDB_API_KEY="..."  # your W&B API key here

Supervised Fine-Tuning (SFT)

bash scripts/sft-correction.sh \
    --train_datasets <your-correction-dataset> \
    --model_name_or_path <your-model-name-or-checkpoint-path> \
    --output_dir output/sft

NOTE:

You may need to update some of the parameters in the script according to your machine setup, such as the number of GPUs for training, the training batch size, etc.
Your dataset format should be consistent with aligner/template-dataset.json

Dataset & Models

We have open-sourced a 20K training dataset and a 7B Aligner model. Further dataset and models will come soon.

Citation

Please cite our work if you find it useful and meaningful.

@article{ji2024aligner,
  title={Aligner: Achieving efficient alignment through weak-to-strong correction},
  author={Ji, Jiaming and Chen, Boyuan and Lou, Hantao and Hong, Donghai and Zhang, Borong and Pan, Xuehai and Dai, Juntao and Yang, Yaodong},
  journal={arXiv preprint arXiv:2402.02416},
  year={2024}
}

Acknowledgment

This repository benefits from LLaMA, Stanford Alpaca, DeepSpeed, DeepSpeed-Chat and Safe-RLHF.

Thanks for their wonderful works and their efforts to further promote LLM research. Aligner and its related assets are built and open-sourced with love and respect ❤️.

This work is supported and funded by the Peking University.

License

Aligner is released under Apache License 2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

(NeurIPS 2024 Oral) Aligner: Efficient Alignment by
Learning to Correct

Abstract

Table of Contents

Aligner: Efficient Alignment by Learning to Correct

Architecture of the Aligner module.

Illustration of its behavior in architecture and semantic space.

Performance of Aligner Models

More Details

Installation

Training

Dataset & Models

Citation

Acknowledgment

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

(NeurIPS 2024 Oral) Aligner: Efficient Alignment by Learning to Correct

Abstract

Table of Contents

Aligner: Efficient Alignment by Learning to Correct

Architecture of the Aligner module.

Illustration of its behavior in architecture and semantic space.

Performance of Aligner Models

More Details

Installation

Training

Dataset & Models

Citation

Acknowledgment

License

(NeurIPS 2024 Oral) Aligner: Efficient Alignment by
Learning to Correct