Releases: huggingface/trl
v0.10.1
We are excited to introduce the new v0.10.1 release, with many new exciting features and post-training algorithms. The highlights are as follows:
Online DPO
Online DPO is a new alignment method from DeepMind to boost the performance of LLMs. With Online DPO, data is generated on the fly by the trained model (instead of pre-collected). For each prompt, two completions are generated, with a reward model selecting the preferred one. This approach:
- Eliminates the need for a pre-collected preference dataset (it's generated online)
- Enables continuous model improvement
- Yields better results than traditional DPO
To train models with this method, use the OnlineDPOTrainer
Liger Triton kernels for supercharged SFT
- We've integrated LinkedIn's Liger Triton kernels to the
SFTTrainer
for faster throughput and lower memory usage. To use them, setuse_liger_kernel
inSFTConfig
DPO for VLMs
- We've added support to align vision-language models with DPO, now covering architectures LLaVa-1.5, PaliGemma, and Idefics2. To train VLMs with DPO, use the
dpo_visual.py
script as follows
accelerate launch examples/scripts/dpo_visual.py \
--dataset_name HuggingFaceH4/rlaif-v_formatted \
--model_name_or_path google/paligemma-3b-pt-224 \
--trust_remote_code \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--output_dir dpo_paligemma_rlaif-v \
--bf16 \
--torch_dtype bfloat16
WinRate callback for LLM as a judge
- We've added support to compute win rates over the reference model for methods like DPO. To do so, configure the callback to point to the LLM as judge API (OpenAI or Hugging Face Inference API) and then add:
trainer = DPOTrainer(...)
win_rate_callback = WinRateCallback(..., trainer=trainer)
trainer.add_callback(win_rate_callback)
Anchored Preference Optimisation (APO) for fine-grained human/AI feedback
- Added the APO method, which is an "anchored" version of the alignment objective. There are two variants:
apo_zero
andapo_down
. Theapo_zero
loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand,apo_down
decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, setloss_type="apo_zero"
orloss_type="apo_down"
in theDPOConfig
What's Changed
- Set dev version by @vwxyzjn in #1817
- Upgrade GitHub actions by @qgallouedec in #1818
- DPO Llava 1.5 and PaliGemma support by @qgallouedec in #1797
- Delete unused benchmark.yml workflow by @AdnaneKhan in #1822
- Consistent use of trust_remote_code by @qgallouedec in #1806
- Fix: authentication token kwarg not passed when loading PEFT adapters by @mkopecki in #1825
- refactor trainer callbacks by @kashif in #1826
- Uniform
model_ref
naming by @qgallouedec in #1835 - fix ppov2_trainer tensorboard logging bug by @DZ9 in #1836
- Fix issues of KTOTrainer by @MAOJIASONG in #1840
- add link to DPO datasets collection by @davanstrien in #1845
- fix arg parsing in chat.py by @lvwerra in #1846
- DPO for VLM blog post in doc by @qgallouedec in #1844
- Add WinRateCallback and Judges by @lewtun in #1598
- Remove
CI_HUB_USER_TOKEN
by @qgallouedec in #1852 - Online DPO and Online trainer refactor by @vwxyzjn in #1809
- [online-DPO] online dpo cleanups by @kashif in #1864
- arXiv to HF Papers by @qgallouedec in #1870
- fix fsdp & qlora support by @eliebak in #1863
- Import missing
setup_chat_format
by @Rishav-hub in #1862 - Bug Fix while training using SFTTrainer with DataCollatorForCompletionOnlyLM by @Rishav-hub in #1861
- Small fixes to online dpo example by @edbeeching in #1879
- Skip BigBird save and load test until next transformers version by @qgallouedec in #1874
- Llama in modelling value head tests by @qgallouedec in #1878
- Improve judges by @qgallouedec in #1856
- [Do not merge] Re-add BigBird Pegasus save/load test by @qgallouedec in #1876
- Re-add BigBird Pegasus save/load test by @qgallouedec in #1882
- Move BCO to separate BCOTrainer with fixes by @claralp in #1869
- Update example overview documentation section by @qgallouedec in #1883
- fix dpo_trainer bug for LLMs without bos_token in config by @DZ9 in #1885
- Fix SFT for VLM example by @qgallouedec in #1865
evaluation_strategy
->eval_strategy
by @qgallouedec in #1894- fix serialization of RunningMoments on multiple GPUs by @claralp in #1892
- [WIP] Fix CI by @qgallouedec in #1897
- Drop
setUpClass
in reward tester by @qgallouedec in #1895 - Support
IterableDataset
forSFTTrainer
by @qgallouedec in #1899 - Fix data processing in ORPO example script by @qgallouedec in #1903
- [RPO] use loss from v3 of paper by @kashif in #1904
- Support Rank Stabilized LoRA in the ModelConfig/LoraConfig by @JohnGiorgi in #1877
- [Online-DPO] num_generation_per_prompt is fixed by @kashif in #1898
- Fix GPT2 sentiment notebook reward by @cemiu in #1738
- Fix
AlignPropTrainer
import by @qgallouedec in #1908 - Various args and test fix by @qgallouedec in #1909
lr_scheduler.step()
afteroptimizer.step()
by @qgallouedec in #1918torch.cuda.amp.autocast()
->torch.amp.autocast("cuda")
by @qgallouedec in #1921- Fix orpo trainer loss device by @SunMarc in #1919
- Add transformers library name for TRL repos by @lewtun in #1922
- Standardize
dataset_num_proc
usage by @qgallouedec in #1925 PartialState().local_main_process_first()
when map in examples by @qgallouedec in #1926- minor BCO fixes by @claralp in #1923
- Improve DPO/loss doc by @qgallouedec in #1929
- feat: anchored pref optimization by @karel-contextual in #1928
- Add tests for DPO for VLM by @qgallouedec in #1935
- fix model to save in ppov2 by @mnoukhov in #1776
- Optional Additional Loss to Center Reward Models' Outputs by @RylanSchaeffer in #1932
- Properly label all models when pushed to the hub by @qgallouedec in #1940
- Skip token in
push_to_hub
by @qgallouedec in #1945 - Fix model wrapping for online DPO by @lewtun in #1946
- Don't mark issues as stale if nobody answered by @qgallouedec in #1949
- Add a simple-to-understand example for online DPO by @vwxyzjn in #1947
- Log WandB tables on main process by @lewtun in #1951
- [ODPO] Fix global step for consistent checkpointing with global updates by @lewtun in #1950
- "help wanted" in label to exempt from stale by @qgallouedec in #1956
- Fix response truncation in examples/notebooks/gpt2-sentiment.ipynb by @qgallouedec in #1957
- [ODPO] Refactor training script to use messages API by @lewtun in #1958
- Support LLaVA-NeXT in Vision SFT by @qgallouedec in #1959
- Add i...
v0.9.6 release
We are excited to introduce the new v0.9.6 release. Many new exciting features and algorithms. The highlights are as follows:
- Support for SimPO by @fe1ixxu, a reference-free method that also regularizes output length. To use this loss, the users can input
loss_type="simpo"
andcpo_alpha=0
in theCPOConfig
and use it with theCPOTrainer
.
- Added AlignProp by @mihirp1998, a method for finetuning Stable Diffusion model using reward gradients.
- Added Efficient Exact Optimization (EXO) by @haozheji
We also included many important fixes and improvements such as fixing prints in the CLI with GCP containers by @alvarobartt. Enjoy the release!
What's Changed
- set dev version by @younesbelkada in #1710
- Add a variant of CPO, SimPO by @fe1ixxu in #1703
- [RPO] fix nll loss by @kashif in #1705
- fix yaml parser for derived config classes by @mnoukhov in #1713
- Fix default padding_value in dpo_config.py by @mnoukhov in #1692
- feat(ci): add trufflehog secrets detection by @McPatate in #1721
- ktotrainer: Refuse datasets which contain only one class of labels by @jetlime in #1724
- adds AOT by @imelnyk in #1701
- Workflow: Notify tests results on slack channel by @younesbelkada in #1744
- better trl parser with yaml config by @mnoukhov in #1739
- CI / core: Pin
numpy
to!=2.0.0
for CI and to users by @younesbelkada in #1747 TrlParser
: Add ignore extra args option by @younesbelkada in #1748- small KTO fixes by @kawine in #1734
- CPO / DPO: Fix red CI by @younesbelkada in #1749
- prepare deepspeed accomodate fp16 and bf16 by @mnoukhov in #1728
- CI /
KTOTrainer
: Remove old tests by @younesbelkada in #1750 - change the
process
function in the example of DPO by @AIR-hl in #1753 - Integrate f-divergence to DPO (Follow up) by @1485840691 in #1610
- Support for returning past_key_values from the model by @idanshen in #1742
- Fix masking of response tokens by @mertsayar8 in #1718
- Support num_train_epochs by @vwxyzjn in #1743
- Fix: Add dataset_text_field in examples/scripts/sft.py by @scottsuk0306 in #1758
- New sentiment and descriptiveness dataset by @vwxyzjn in #1757
- Add CPO-SimPO method by @fe1ixxu in #1760
- Added Reward Backpropogation Support by @mihirp1998 in #1585
- MoE Models: option to add load balancing loss by @claralp in #1765
evaluation_strategy
toeval_strategy
by @qgallouedec in #1771- add Efficient Exact Optimization (EXO) by @haozheji in #1735
- Remove the leading space in the tldr preference dataset by @vwxyzjn in #1773
- Fix Documentation Overflow Issues for Long URLs in SFTConfig by @Mubin17 in #1774
- Visual DPO by @qgallouedec in #1647
- [DOCS] fix docs and cli example script by @kashif in #1780
- Fixed typo in SFT trainer docs by @detsutut in #1788
- [SFT] add model_init_kwargs to training_args by @kashif in #1787
- Bugfix: Preserve token fields when converting TrainingArguments to SFTConfig by @noahlt in #1794
- Clean examples by @qgallouedec in #1791
- Remove extra print in reward_trainer.py by @mnoukhov in #1799
- Fix
torch_dtype
handling in{DPO,SFT}Trainer
when provided via CLI by @alvarobartt in #1807 - Fix
TRL_USE_RICH
environment variable handling by @alvarobartt in #1808 - 0.9.6 release by @vwxyzjn in #1816
New Contributors
- @McPatate made their first contribution in #1721
- @jetlime made their first contribution in #1724
- @imelnyk made their first contribution in #1701
- @AIR-hl made their first contribution in #1753
- @1485840691 made their first contribution in #1610
- @idanshen made their first contribution in #1742
- @mertsayar8 made their first contribution in #1718
- @scottsuk0306 made their first contribution in #1758
- @mihirp1998 made their first contribution in #1585
- @haozheji made their first contribution in #1735
- @Mubin17 made their first contribution in #1774
- @detsutut made their first contribution in #1788
- @noahlt made their first contribution in #1794
Full Changelog: v0.9.4...v0.9.6
v0.9.4
Mainly backward compatibility fixes with SFTTrainer.
What's Changed
- Fixed doc string and related docs for the SFTConfig update by @GuilhermeFreire in #1706
- SFTTrainer: Fix backward Compatibility issue with
TrainingArguments
by @younesbelkada in #1707 - 0.9.4 release by @vwxyzjn in #1708
New Contributors
- @GuilhermeFreire made their first contribution in #1706
Full Changelog: v0.9.3...v0.9.4
v0.9.3 RLOO / PPOv2 Trainer, RM Visualization
We are excited to introduce the new v0.9.3 release. Many new exciting features and algorithms. The highlights are as follows:
- RLOO Trainer: RLOO (Reinforce Leave-one-out) is a new online RL algorithm for RLHF, proposed by Ahmadian et al from Cohere. Check out our docs here to get started
- PPOv2 Trainer: We are introducing a new experimental PPOv2 trainer which is more aligned with OpenAI's PPO implementation based on https://arxiv.org/abs/2403.17031. Check out our docs here to get started
- Reward model visualization: the reward model training now includes visualization on the eval dataset, as shown below.
Screen.Recording.2024-05-09.at.2.37.44.PM.mov
- New losses in the DPO Trainer: DPOTrainer now includes losses / support for Self-play Preference Optimization, Robust DPO, TR-DPO, Iterative Reasoning Preference Optimization, and Pairwise Noise Contrastive Alignment
- New losses in the KTO Trainer: KTOTrainer now includes the loss for Binary Classifier Optimization (BCO)
What's Changed
- set dev version by @younesbelkada in #1568
- fix add_special_tokens issue for data with template by @edixiong in #1509
- [DPO] add 'bco_pair' loss_type by @seanexp in #1524
- [DPO] DPOConfig class by @kashif in #1554
- [SFT] add SFT Trainer Config dataclass by @kashif in #1530
- FIX: Fix CI on transformers main by @younesbelkada in #1576
- [
SFTTrainer
] Add warning in SFTTrainer when dataset already processed by @younesbelkada in #1577 - Fix typo detoxifying doc by @qgallouedec in #1594
- Core: removed unexisting
SftArgumentParser
by @younesbelkada in #1602 - [
KTOTrainer
] add BCO (reward shift and underlying distribution matching) by @seanexp in #1599 - [CLI] Use auto device map for model load by @lewtun in #1596
- Removing
tests/
from package data by @jamesbraza in #1607 - Docs: Fix build main documentation by @younesbelkada in #1604
- support loss function for Self-play Preference Optimization by @winglian in #1612
- Update HH dataset on helpful only subset by @vwxyzjn in #1613
- corrects loss function for Self-play Preference Optimization hard label version by @angelahzyuan in #1615
- Fix ZeRO-3 generation context manager by @lewtun in #1617
- fixed adding bos and eos token unconditionally by @jasonyux in #1591
- visualize rm prediction by @vwxyzjn in #1636
- [ORPO] Correct label mask for pad tokens by @IlyaGusev in #1625
- Update sft_llama2.py to work with the latest API by @xianbaoqian in #1637
- Fixed wrong logs prefixes in KTOTrainer by @bartoszzuk in #1641
- Pairwise Noise Contrastive Alignment by @winglian in #1632
- don't cast the trainable lora layers to half precision by @pacman100 in #1644
- PPO / Reinforce Trainers by @vwxyzjn in #1540
- Apply deprecated
evaluation_strategy
by @muellerzr in #1559 - FEAT: Add support for training collator in PPOTrainer by @younesbelkada in #1658
- Correct Documentation for cDPO Usage by @AliBakly in #1655
- Fix inheritance order in PPOv2Config by @Nicolinho in #1659
- [DPO] Add 'robust' loss_type by @Abilityguy in #1653
- 🤫 TR-DPO implementation by @syrn1k in #1593
- Do not upcast adapters when using FSDP+QLoRA by @pacman100 in #1654
- [Tests] update eval_strategy API by @kashif in #1662
- Fix ppov2 test case by @vwxyzjn in #1661
- FIX / PPO: Fix
enable_input_require_grads
issues with PPO models by @younesbelkada in #1664 - fix dataset load error by @sywangyi in #1670
- FIX / SFTTrainer: Fix SFTTrainer with
args=None
by @younesbelkada in #1678 - Fix max_completion_length for encoder_decoder models in KTO Trainer by @samuki in #1588
- intial RPO loss by @kashif in #1686
- Fix overriding optimize_device_cache with optimize_cuda_cache in PPOConfig by @alexisrozhkov in #1690
- Skip packing validation by @alex-jw-brooks in #1673
- Fix typo in DPOTrainer's warnings by @qgallouedec in #1688
- Quick fix on GPT4-eval by @vwxyzjn in #1696
- Release 0.9.2 by @vwxyzjn in #1697
New Contributors
- @edixiong made their first contribution in #1509
- @seanexp made their first contribution in #1524
- @jamesbraza made their first contribution in #1607
- @winglian made their first contribution in #1612
- @angelahzyuan made their first contribution in #1615
- @jasonyux made their first contribution in #1591
- @IlyaGusev made their first contribution in #1625
- @xianbaoqian made their first contribution in #1637
- @bartoszzuk made their first contribution in #1641
- @muellerzr made their first contribution in #1559
- @AliBakly made their first contribution in #1655
- @Nicolinho made their first contribution in #1659
- @Abilityguy made their first contribution in #1653
- @syrn1k made their first contribution in #1593
- @alexisrozhkov made their first contribution in #1690
- @alex-jw-brooks made their first contribution in #1673
Full Changelog: v0.8.6...v0.9.2
v0.8.6: Fixes for CLI
What's Changed
- set dev version by @younesbelkada in #1556
- [CLI] Update init.py imports by @kashif in #1557
- CLI: Add warning when ignored params are passed + parse config file if config if passed by @younesbelkada in #1565
- Release: v0.8.6 by @younesbelkada in #1567
Full Changelog: v0.8.5...v0.8.6
v0.8.5: Important fixes for CLIs
What's Changed
- set dev version by @younesbelkada in #1548
- FIX: make the train / test fields modulable by @younesbelkada in #1551
- enable multiple eos tokens by @lvwerra in #1553
- Release: v0.8.5 by @younesbelkada in #1555
Full Changelog: v0.8.4...v0.8.5
v0.8.4: CLI / CPO / KTO important fixes
This patch release includes important fixes for the CLI and KTO & CPO trainers
What's Changed
- set dev version by @younesbelkada in #1529
- [CPO] fix memory leak due to retained value by @kashif in #1531
- VSFT hotfix - adds gen prompt to template and processor to hub by @edbeeching in #1532
- save_model -> save_pretrained in ppo_trainer.mdx by @ejmejm in #1537
- [KTO] support to load the adapter twice by @claralp in #1542
- CLI: Set
dataset_text_field
toNone
to allow ChatML automatic template by @younesbelkada in #1545 - FIX: Fix slow test by @younesbelkada in #1546
- Fixed ref model not used in PPO generation by @ejmejm in #1534
- Release: v0.8.4 by @younesbelkada in #1547
New Contributors
Full Changelog: v0.8.3...v0.8.4
v0.8.3: Patch release for CLI
What's Changed
This is a patch release that includes an import fix for CLIs
- set dev version by @younesbelkada in #1523
- [CLI] fix imports by @kashif in #1527
- Release: v0.8.3 by @younesbelkada in #1528
Full Changelog: v0.8.2...v0.8.3
v0.8.2: ORPO & CPO Trainer / Vision LLMs support for `SFTTrainer`, KTO fixes
ORPO Trainer & Vision LLMs support for SFTTrainer, KTO fixes
This release includes two new trainers: ORPO from KAIST and CPO
The release also includes Vision LLM such as Llava support for SFTTrainer
, please see: https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py for more details
ORPO Trainer
CPO Trainer
- Add CPOTrainer by @fe1ixxu in #1382
- Add
use_cache=False
in{ORPO,CPO}Trainer.concatenated_forward
by @alvarobartt in #1478 - [ORPO] Update NLL loss to use
input_ids
instead by @alvarobartt in #1516
VLLMs support for SFTTrainer
You can now use SFTTrainer
to fine-tune VLLMs such as Llava !
See: https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py for more details
- Adds VLM Training support to SFTTrainer + VSFT script by @edbeeching in #1518
KTO Fixes
Many fixes were introduced for the KTOTrainer:
- Update KTO example to use better model and ChatML support by @lewtun in #1485
- [KTO] Use batching to speed up data processing by @lewtun in #1470
- Update KTO example with good dataset & chat format by @lewtun in #1481
- [KTO] fix interleaving, reporting, and hanging bugs by @kawine and @claralp in #1499
- [KTO] fix metric logging by @claralp in #1514
10x PPO !
Other fixes
- set dev version by @younesbelkada in #1463
- Use the standard dataset for DPO CLI by @vwxyzjn in #1456
- [peft] Update test_reward_trainer.py to fix tests by @kashif in #1471
- Fix hyperparameters in KTO example by @lewtun in #1474
- docs: add missing Trainer classes and sort alphabetically by @anakin87 in #1479
- hackey update to ModelConfig to allow lora_target_modules="all-linear" by @galtay in #1488
- Ignore chat files by @lewtun in #1486
- Add DPO link in README by @qgallouedec in #1502
- Fix typo in how_to_train.md by @ftorres16 in #1503
- Fix DPO Unsloth example in Docs by @arnavgarg1 in #1494
- Correct ppo_epochs usage by @muhammed-shihebi in #1480
- Fix
RichProgressCallback
by @eggry in #1496 - Change the device index to device:index by @yuanwu2017 in #1490
- FIX: use kwargs for RMTrainer by @younesbelkada in #1515
- Allow streaming (datasets.IterableDataset) by @BramVanroy in #1468
- Allow pre-tokenized datasets in SFTTrainer by @BramVanroy in #1520
- [DOC] Add data description for sfttrainer doc by @BramVanroy in #1521
- Release: v0.8.2 by @younesbelkada in #1522
New Contributors
- @fe1ixxu made their first contribution in #1382
- @anakin87 made their first contribution in #1479
- @galtay made their first contribution in #1488
- @qgallouedec made their first contribution in #1502
- @ftorres16 made their first contribution in #1503
- @arnavgarg1 made their first contribution in #1494
- @muhammed-shihebi made their first contribution in #1480
- @eggry made their first contribution in #1496
- @claralp made their first contribution in #1514
Full Changelog: v0.8.1...v0.8.2
v0.8.1: Patch release for CLIs
This patch release includes some important fixes for CLIs
What's Changed
- set dev version by @younesbelkada in #1454
- Fix chat CLI for model revisions by @lewtun in #1458
- [chat] add eos token to generate by @lvwerra in #1459
- Release: v0.8.1 by @younesbelkada in #1462
Full Changelog: v0.8.0...v0.8.1