This project focuses on the fine-tuning of large language models via Reinforcement Learning with Human Feedback (RLHF). Our primary objective is to enhance Google's Text to Text Transfer Transformer (T5) model using the OpenHermesPreferences dataset. For the optimization process, we employ Proximal Policy Optimization (PPO) to refine the model's performance in generating text that aligns more closely with human preferences and values. We used PairRM as the reward model.
Ref: https://huggingface.co/blog/rlhf
-
Clone this repository:
git clone https://github.com/gtamer2/rl_final_project.git
-
Install the dependencies:
pip install -r requirements.txt
Execute the training script with the following command:
python main.py --model_name="google-t5/t5-small" --batch_size=32 --epochs=200 --mode="train"
batch_size
: Batch size for training.epochs
: Number of training epochs.model_name
: LLM Model.lr
: Learning rate for the optimizer.model_save_path
: Path to save the trained model.rewards_save_path
: Path to save the rewards.dataset_size
: Number of data samples (Use-1
to train on the entire dataset)seed
Execute the prediction script with the following command:
python main.py --model_name="my_ppo_model" --batch_size=32 --mode="predict"
batch_size
: Batch size for training.model_name
: LLM Model.dataset_size
: Number of data samples (Use-1
to generate predictions for the entire test set)
To visualize the reward curve, use the following command:
python main.py --rewards_save_path="reward.npy" --mode="visualize"
Model | Avg. Reward | Avg. BLEU Score | Avg. BERT Score |
---|---|---|---|
T5 Original | -11.2550 | 0.0024 | 0.0273 |
T5 with RLHF | -4.7752 | 0.0143 | 0.0339 |
Average Reward Curve: