Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix misleading variable "epoch" from the training loop from PPOTrainer Doc. #1171

Merged
merged 2 commits into from
Jan 8, 2024

Conversation

Jfhseh
Copy link
Contributor

@Jfhseh Jfhseh commented Jan 2, 2024

The usage of the variable “epoch” is misleading (incorrect) in the original Doc:
the dataloader does not contain the data for ALL epochs, but 1 epoch only, thus
"for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader))" is misleading and does not actually stores the epoch #.

The correct version comes from the TRL PPO notebook tutorial (https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment-control.ipynb), which uses an outer loop to capture the epochs.

I posted also the question on forum: https://discuss.huggingface.co/t/confusing-and-possibly-misleading-ppo-trainer-code-from-trl-api-doc-tutorial/67531
image

image

The usage of the variable “epoch” is misleading in the original Doc, the dataloader does not contain the data for ALL epochs, but 1 only, thus 
"for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader))"
is misleading and does not actually stores the epoch #. 

The correct version comes from the TRL PPO notebook tutorial 
(https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment-control.ipynb), which uses an outer loop to capture the epochs.

I posted also the question on forum: https://discuss.huggingface.co/t/confusing-and-possibly-misleading-ppo-trainer-code-from-trl-api-doc-tutorial/67531
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)
for epoch in tqdm(range(ppo_trainer.config.ppo_epochs), "epoch: "):
for batch_id, batch in tqdm(enumerate(ppo_trainer.dataloader)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't use the batch_id we could remove the enumerate and batch_id altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Yes, I removed the batch_id.

@lvwerra
Copy link
Member

lvwerra commented Jan 4, 2024

Sounds good to me. Note that the reason people referred to this part as an epoch is because it constitutes a PPO optimization epoch. But I agree that it's a bit confusing. Left a minor comment.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying the docs!

@younesbelkada younesbelkada merged commit ad597db into huggingface:main Jan 8, 2024
9 checks passed
jondurbin pushed a commit to jondurbin/trl that referenced this pull request Jan 8, 2024
…r Doc. (huggingface#1171)

* Fix misleading variable "epoch" from PPOTrainer Doc. 

The usage of the variable “epoch” is misleading in the original Doc, the dataloader does not contain the data for ALL epochs, but 1 only, thus 
"for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader))"
is misleading and does not actually stores the epoch #. 

The correct version comes from the TRL PPO notebook tutorial 
(https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment-control.ipynb), which uses an outer loop to capture the epochs.

I posted also the question on forum: https://discuss.huggingface.co/t/confusing-and-possibly-misleading-ppo-trainer-code-from-trl-api-doc-tutorial/67531

* Remove batch_id
lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024
…r Doc. (huggingface#1171)

* Fix misleading variable "epoch" from PPOTrainer Doc.

The usage of the variable “epoch” is misleading in the original Doc, the dataloader does not contain the data for ALL epochs, but 1 only, thus
"for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader))"
is misleading and does not actually stores the epoch #.

The correct version comes from the TRL PPO notebook tutorial
(https://github.com/huggingface/trl/blob/main/examples/notebooks/gpt2-sentiment-control.ipynb), which uses an outer loop to capture the epochs.

I posted also the question on forum: https://discuss.huggingface.co/t/confusing-and-possibly-misleading-ppo-trainer-code-from-trl-api-doc-tutorial/67531

* Remove batch_id
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants