-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Chunked SimPO Loss #386
Conversation
Great PR @pramodith ! Could you rebase? |
Good to go now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
beta (float): Weight for the odds ratio loss. | ||
gamma (float): The simpo gamma, margin term. | ||
""" | ||
logits = beta * (chosen_logps - rejected_logps) - gamma |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be normalized by length, as you said in the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm, the logps are already averaged
Summary
This PR adds the Simple Preference Optimization Loss function. The only difference between SimPO and CPO is a margin term
gamma
which specifies that the preferred response should be atleast gamma logit points better than the losing response.Note that SimPO explicitly specifies that$$\pi_\theta(y|x)$$ needs to be normalized by length, unlike DPO.
This corresponds to Eq 6 in the paper.
Testing Done
GPU A100-80G-SXM
make test
to ensure correctnessmake checkstyle
to ensure code stylemake test-convergence
to ensure convergence