Curiosity-Driven Reinforcement Learning from Human Feedback

This repository contains the source code for reproducing the CD-RLHF. We implement CD-RLHF based on DeepSpeed-Chat.

Get Started

Prerequisites

We provide the environments used in our experiments in requirements.txt, which can be easily installed with

cd CD-RLHF
pip install -r requirements.txt

We use transformers==4.46.3 when training Llama-3.2 models.

Installation

Installing our implemented deepspeed-chat:

cd CD-RLHF/applications/DeepSpeed-Chat
pip install -e .

Dataset

We conduct experiments on two datasets: OpenAI TL;DR, and UltraFeedback.

We split each dataset into three parts to carry out Supervised Fine-Tuning (SFT), reward modeling, and Reinforcement Learning with Human Feedback (RLHF) fine-tuning. The data is partitioned as follows: 20% for SFT, 40% for reward modeling, and 40% for RLHF fine-tuning. This partitioning is automated using Deepspeed-Chat codes with a random seed set to 1234.

Training

SFT Training

cd applications/DeepSpeed-Chat/training/step1_supervised_finetuning && bash training_scripts/tldr/run_gemma_2b.sh

This command train the SFT model with Gemma-2B model on TL;DR datasets.

Reward Modeling

cd applications/DeepSpeed-Chat/training/step2_reward_model_finetuning && bash training_scripts/tldr/run_gemma_wb.sh

This command train the reward model with Gemma-2B model on TL;DR datasets.

PPO Training

cd applications/DeepSpeed-Chat/training/step3_rlhf_finetuning && bash training_scripts/tldr/run_gemma_2b.sh

This command train the RLHF model with Gemma-2B model on TL;DR datasets using vanilla RLHF.

cd applications/DeepSpeed-Chat/training/step3_rlhf_finetuning && bash training_scripts/tldr/run_gemma_2b_cdrlhf.sh

This command train the RLHF model with Gemma-2B model on TL;DR datasets using CD-RLHF.

In the scripts, the $\eta=0.0$ corresponds to vanilla RLHF, and a non-zero value corresponds to CD-RLHF. By default, we keep the learning rate of intrinsic curiosity module to 1e-5, and use top-1 to enable curiosities. These hyper-parameters can be changed using --icm_learning_rate and --cdrlhf_topk.

Inference

We provide a Python file for inference located at CD-RLHF/evaluation/reward_model_score.py. In this file, we randomly select 2000 instances from the validation sets (using seed 1234) for inference. The generated responses are then scored with the reward model.

You can run the inference script with the following command:

cd MA-RLHF
python evaluation/reward_model_score.py \
    --dataset-path openai/summarize_from_feedback \
    --model-path ${actor_model_path} \
    --model-name gemma-2b-tldr-rlhf \ 
    --reward-model ${reward_model_path} \
    --gpus 0,1,2,3,4,5,6,7 \
    --batch-size 16

The --dataset argument can be chosen from [openai/summarize_from_feedback, HuggingFaceH4/ultrafeedback_binarized], corresponding to openai/summarize_from_feedback and HuggingFaceH4/ultrafeedback_binarized, respectively. The inference results will be saved in ./results/rewards/summarize_from_feedback/gemma-2b-tldr-rlhf.jsonl.

Besides, we provide CD-RLHF/evaluation/generate_samples.py for diversity metric evaluation, which randomly selects 500 samples and generates 10 responses per prompt.

cd MA-RLHF
python evaluation/generate_samples.py \
    --dataset-path openai/summarize_from_feedback \
    --model-path ${actor_model_path} \
    --model-name gemma-2b-tldr-rlhf \ 
    --gpus 0,1,2,3,4,5,6,7 \

The args are same as the one used in reward_model_score.py. The results will be saved in ./results/generated/summarize_from_feedback/gemma-2b-tldr-rlhf.jsonl.

Evaluation

RM Scores

The RM scores have already been computed during the inference stage.

Diversity

We implement the four diversity metrics used in our paper: Diversity, EAD, SelfBLEU, SentBERT. These metrics can be calculated using ./evaluation/diversity_eval.py:

python evaluation/diversity_eval.py \
    --file results/generated/summarize_from_feedback/gemma-2b-tldr-rlhf.jsonl,results/generated/summarize_from_feedback/gemma-2b-tldr-cdrlhf.jsonl

The --file can be multiple files seperated by comma.

GPT-4 Evaluation

For GPT-4 evaluations, we randomly select 50 instances from the inference results. This can be done with the following command:

python evaluation/sample_from_dataset.py \
    --data-path ./results/generated/${dataset}/${actor_model_name}.jsonl \
    --save-path ./results/generated/${dataset}/${actor_model_name}-sampled.jsonl \
    --dataset summarize

Specifically, we select 50 instances for the TL;DR dataset based on the provided SubReddit information with --dataset summarize, and 50 instances for the UltraFeedback datasets by random with --dataset ultrafeedback.

The GPT-4 evaluation results can be obtained using:

python evaluation/gpt4-eval.py \
    --model_name_a ./results/generated/${dataset}/${actor_model_name_a}-sampled.jsonl \
    --model_name_b ./results/generated/${dataset}/${actor_model_name_b}-sampled.jsonl \
    --output ${PROJ_PATH}/results/${dataset}/${actor_model_name_a}-v.s.-${actor_model_name_b}.jsonl \
    --sk ${OPENAI_SK}

Citation

@misc{sun2025curiositydrivenreinforcementlearninghuman,
      title={Curiosity-Driven Reinforcement Learning from Human Feedback}, 
      author={Haoran Sun and Yekun Chai and Shuohuan Wang and Yu Sun and Hua Wu and Haifeng Wang},
      year={2025},
      eprint={2501.11463},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.11463}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
applications/DeepSpeed-Chat		applications/DeepSpeed-Chat
evaluation		evaluation
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Curiosity-Driven Reinforcement Learning from Human Feedback

Get Started

Prerequisites

Installation

Dataset

Training

SFT Training

Reward Modeling

PPO Training

Inference

Evaluation

RM Scores

Diversity

GPT-4 Evaluation

Citation

About

Releases

Packages

Contributors 2

Languages

License

ernie-research/CD-RLHF

Folders and files

Latest commit

History

Repository files navigation

Curiosity-Driven Reinforcement Learning from Human Feedback

Get Started

Prerequisites

Installation

Dataset

Training

SFT Training

Reward Modeling

PPO Training

Inference

Evaluation

RM Scores

Diversity

GPT-4 Evaluation

Citation

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages