Paper implementation and adaption of Self-Rewarding Language Models.
This project explores Self Rewarding Language Models from Yuan et al., 2024, utilizing LLM-as-a-Judge to allow a model to self-improve. It integrates Low-Rank Adaptation from Hu et al., 2021 optimizing adaptability without full tuning.
./setup.sh
Note: This will create an virtual environment, install the required packages and download the data.
In the config.yaml
file, you can set the following parameters:
cuda_visible_devices
: The GPU to use (0 for first GPU, 1 for second GPU or 0 and 1 for both)model_name
: The name of the model to use. Choose from huggingface hubtokenizer_name
: The name of the tokenizer to use. Choose from huggingface hubwandb_enable
: True or False. If True, logs will be sent to wandbwandb_project
: The name of the wandb projectpeft_config
: Adapt the PEFT configuration according your needsiterations
: The number of iterations to self improve the modelsft_training
: SFT training hyperparametersdpo_training
: DPO training hyperparametersgenerate_prompts
: The amount of prompts to generate in each iterationgenerate_responses
: The amount of responses per prompt to generate in each iteration
To run the training, execute the following command:
python -m src.train.train