Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct Preference Optimization #530

Merged
merged 48 commits into from
Dec 14, 2023
Merged

Direct Preference Optimization #530

merged 48 commits into from
Dec 14, 2023

Conversation

maxjeblick
Copy link
Contributor

This PR adds DPO (https://github.com/eric-mitchell/direct-preference-optimization) as a new problem type.
Also, IPO can be selected via the associated loss function.

Apart from adding the new problem type, the following changes have been made:

  • A new dataset for DPO is loaded by default https://huggingface.co/datasets/Intel/orca_dpo_pairs
  • Code to create HH DPO dataset compatible to llm studio format has been added
  • Default Dataset creation will now show a progress pop up that informs the user
  • Some small refactoring in various places

Possible follow-up work:

  • Add better insights; Validation Insights and Train Data Insights do not show the rejected answer. I haven't added it to make the PR slimmer.
  • Add Reward margin plot to charts.
  • Add problem type dropdown selection to dataset import

Copy link
Collaborator

@pascal-pfeiffer pascal-pfeiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a LOT @maxjeblick

Very high quality PR, and it works flawless on many different setups and datasets that I tested.
One thing, that we should change in the future is the dataset import. Currently, only the default settings for Causal modeling can be set during import, so one always needs to change that when starting an experiment in e.g. DPO training.

Also, rewards are logged, but never displayed (only when using neptune) -> potential good new feature as you mentioned.

While still not easy to get better results over standard fine-tuning, as DPO is way more user friendly, I am rooting for fully replacing RLHF (by PPO) with it in a subsequent PR.

Minor change needed:

  • Rejected Answer Column is missing a tooltip

experiment_name: str = field(default_factory=generate_experiment_name)
_parent_experiment: str = ""
# 7b model may be unstable (NaN loss)
llm_backbone: str = "h2oai/h2ogpt-4096-llama2-13b-chat"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:
Maybe, we should replace by a "h2ogpt-gm" fine-tune that already has the same prompting style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a specific model in mind? I could also change the default prompt style values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably just change default values to something that works. Such as using mistralai/Mistral-7B-Instruct-v0.1 and its prompting style.

@maxjeblick maxjeblick merged commit db63693 into main Dec 14, 2023
5 checks passed
@maxjeblick maxjeblick deleted the max/dpo2 branch December 14, 2023 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants