Skip to content

Releases: huggingface/trl

v0.10.1

29 Aug 14:34
Compare
Choose a tag to compare

We are excited to introduce the new v0.10.1 release, with many new exciting features and post-training algorithms. The highlights are as follows:

Online DPO

Screenshot 2024-08-29 at 15 53 29

Online DPO is a new alignment method from DeepMind to boost the performance of LLMs. With Online DPO, data is generated on the fly by the trained model (instead of pre-collected). For each prompt, two completions are generated, with a reward model selecting the preferred one. This approach:

  • Eliminates the need for a pre-collected preference dataset (it's generated online)
  • Enables continuous model improvement
  • Yields better results than traditional DPO

To train models with this method, use the OnlineDPOTrainer

Liger Triton kernels for supercharged SFT

image (18)

  • We've integrated LinkedIn's Liger Triton kernels to the SFTTrainer for faster throughput and lower memory usage. To use them, set use_liger_kernel in SFTConfig

DPO for VLMs

  • We've added support to align vision-language models with DPO, now covering architectures LLaVa-1.5, PaliGemma, and Idefics2. To train VLMs with DPO, use the dpo_visual.py script as follows
accelerate launch examples/scripts/dpo_visual.py \
    --dataset_name HuggingFaceH4/rlaif-v_formatted \
    --model_name_or_path google/paligemma-3b-pt-224 \
    --trust_remote_code \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --output_dir dpo_paligemma_rlaif-v \
    --bf16 \
    --torch_dtype bfloat16

WinRate callback for LLM as a judge

  • We've added support to compute win rates over the reference model for methods like DPO. To do so, configure the callback to point to the LLM as judge API (OpenAI or Hugging Face Inference API) and then add:
trainer = DPOTrainer(...)
win_rate_callback = WinRateCallback(..., trainer=trainer)
trainer.add_callback(win_rate_callback)

Anchored Preference Optimisation (APO) for fine-grained human/AI feedback

  • Added the APO method, which is an "anchored" version of the alignment objective. There are two variants: apo_zero and apo_down. The apo_zero loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, apo_down decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, set loss_type="apo_zero" or loss_type="apo_down" in the DPOConfig

What's Changed

Read more

v0.9.6 release

08 Jul 13:51
314e8eb
Compare
Choose a tag to compare

We are excited to introduce the new v0.9.6 release. Many new exciting features and algorithms. The highlights are as follows:

  • Support for SimPO by @fe1ixxu, a reference-free method that also regularizes output length. To use this loss, the users can input loss_type="simpo" and cpo_alpha=0 in the CPOConfig and use it with the CPOTrainer.
image

We also included many important fixes and improvements such as fixing prints in the CLI with GCP containers by @alvarobartt. Enjoy the release!

What's Changed

New Contributors

Full Changelog: v0.9.4...v0.9.6

v0.9.4

06 Jun 14:17
974b0d3
Compare
Choose a tag to compare

Mainly backward compatibility fixes with SFTTrainer.

What's Changed

New Contributors

Full Changelog: v0.9.3...v0.9.4

v0.9.3 RLOO / PPOv2 Trainer, RM Visualization

05 Jun 16:08
c0819ee
Compare
Choose a tag to compare

We are excited to introduce the new v0.9.3 release. Many new exciting features and algorithms. The highlights are as follows:

  1. RLOO Trainer: RLOO (Reinforce Leave-one-out) is a new online RL algorithm for RLHF, proposed by Ahmadian et al from Cohere. Check out our docs here to get started
  2. PPOv2 Trainer: We are introducing a new experimental PPOv2 trainer which is more aligned with OpenAI's PPO implementation based on https://arxiv.org/abs/2403.17031. Check out our docs here to get started
  3. Reward model visualization: the reward model training now includes visualization on the eval dataset, as shown below.
Screen.Recording.2024-05-09.at.2.37.44.PM.mov
  1. New losses in the DPO Trainer: DPOTrainer now includes losses / support for Self-play Preference Optimization, Robust DPO, TR-DPO, Iterative Reasoning Preference Optimization, and Pairwise Noise Contrastive Alignment
  2. New losses in the KTO Trainer: KTOTrainer now includes the loss for Binary Classifier Optimization (BCO)

What's Changed

New Contributors

Full Changelog: v0.8.6...v0.9.2

v0.8.6: Fixes for CLI

22 Apr 08:59
e90e8d9
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.8.5...v0.8.6

v0.8.5: Important fixes for CLIs

18 Apr 11:58
3595eb0
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.8.4...v0.8.5

v0.8.4: CLI / CPO / KTO important fixes

17 Apr 15:22
a5788ac
Compare
Choose a tag to compare

This patch release includes important fixes for the CLI and KTO & CPO trainers

What's Changed

New Contributors

Full Changelog: v0.8.3...v0.8.4

v0.8.3: Patch release for CLI

12 Apr 10:25
9822647
Compare
Choose a tag to compare

What's Changed

This is a patch release that includes an import fix for CLIs

Full Changelog: v0.8.2...v0.8.3

v0.8.2: ORPO & CPO Trainer / Vision LLMs support for `SFTTrainer`, KTO fixes

11 Apr 13:51
143e111
Compare
Choose a tag to compare

ORPO Trainer & Vision LLMs support for SFTTrainer, KTO fixes

This release includes two new trainers: ORPO from KAIST and CPO
The release also includes Vision LLM such as Llava support for SFTTrainer, please see: https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py for more details

ORPO Trainer

CPO Trainer

VLLMs support for SFTTrainer

You can now use SFTTrainer to fine-tune VLLMs such as Llava !
See: https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py for more details

KTO Fixes

Many fixes were introduced for the KTOTrainer:

  • Update KTO example to use better model and ChatML support by @lewtun in #1485
  • [KTO] Use batching to speed up data processing by @lewtun in #1470
  • Update KTO example with good dataset & chat format by @lewtun in #1481
  • [KTO] fix interleaving, reporting, and hanging bugs by @kawine and @claralp in #1499
  • [KTO] fix metric logging by @claralp in #1514

10x PPO !

Other fixes

New Contributors

Full Changelog: v0.8.1...v0.8.2

v0.8.1: Patch release for CLIs

20 Mar 10:39
8534f0e
Compare
Choose a tag to compare

This patch release includes some important fixes for CLIs

What's Changed

Full Changelog: v0.8.0...v0.8.1