Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make training_epoch_end behave like validation_epoch_end #1357

Merged
merged 4 commits into from
Apr 3, 2020

Conversation

jbschiratti
Copy link
Contributor

@jbschiratti jbschiratti commented Apr 3, 2020

What does this PR do?

This PR fixes #914.

The class LightningModule currently implements a validation_epoch_end method but no training_epoch_end method. The CHANGELOG.md file suggests that, from version 0.7.0, the training_end (or now training_step_end) method is to be renamed training_epoch_end. However, this is not satisfactory as training_epoch_end would still not behave like validation_epoch_end.

This PR ensures that both training_epoch_end and validation_epoch_end have the same behavior (for instance: allow the user to log metrics only once at the end of each epoch).

PR review

Anyone in the community is free to review the PR once the tests have passed 👍

@pep8speaks
Copy link

pep8speaks commented Apr 3, 2020

Hello @jbschiratti! Thanks for updating this PR.

Line 575:111: E501 line too long (118 > 110 characters)

Comment last updated at 2020-04-03 11:15:16 UTC

@mergify mergify bot requested a review from a team April 3, 2020 08:20
@Borda Borda added bug Something isn't working feature Is an improvement or enhancement labels Apr 3, 2020
@Borda Borda added this to the 0.7.2 milestone Apr 3, 2020
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! thx 🚀 pls add a note to changelog...

pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
@Borda Borda requested review from jeffling, MattPainter01 and a team April 3, 2020 08:43
@justusschock justusschock requested a review from a team April 3, 2020 08:48
@mergify
Copy link
Contributor

mergify bot commented Apr 3, 2020

This pull request is now in conflict... :(

Copy link
Member

@ethanwharris ethanwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Unfortunately there's a potential memory leak here. The process_output function doesn't detach the tensors, so when I run your code, the each tensor in the OrderedDict given to on_training_epoch_end still has an associated grad_fn. To resolve this we need a way to detach any tensors in the output, perhaps based on process_output.

pytorch_lightning/core/lightning.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved
@Borda Borda requested a review from ethanwharris April 3, 2020 11:26
@Borda Borda added the ready PRs ready to be merged label Apr 3, 2020
@jbschiratti
Copy link
Contributor Author

CircleCI failed on test_ddp_all_dataloaders_passed_to_fit however, AFAIK, this does not seem to be related with the PR. Is it?

@Borda Borda merged commit 868b172 into Lightning-AI:master Apr 3, 2020
@jbschiratti
Copy link
Contributor Author

Thanks @Borda, @ethanwharris and @justusschock 👍🎉!

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! One obvious thing that was missing is now finally here :))

I would like to point out some things that were overlooked.

pytorch_lightning/core/lightning.py Show resolved Hide resolved
pytorch_lightning/core/lightning.py Show resolved Hide resolved
pytorch_lightning/core/lightning.py Show resolved Hide resolved
pytorch_lightning/core/lightning.py Show resolved Hide resolved
@Borda
Copy link
Member

Borda commented Apr 3, 2020

@awaelchli thank you for your careful reading, may apply your suggestions in follow-up PR?

@awaelchli
Copy link
Contributor

yeah sure:)
btw is it not possible to add more commits to this branch and merge again?

@Borda
Copy link
Member

Borda commented Apr 3, 2020

well, you would need to fork jbschiratti:fix_914 (another forked repo) so I would suggest starting with the actual master as this is already merged...

jbschiratti added a commit to jbschiratti/pytorch-lightning that referenced this pull request Apr 3, 2020
@jbschiratti jbschiratti mentioned this pull request Apr 3, 2020
williamFalcon pushed a commit that referenced this pull request Apr 3, 2020
* Doc fixes from #1357 (awaelchli's comments) + changelog.

* Fix indentation.

* Add blank line to fix doc build?
alexeykarnachev pushed a commit to alexeykarnachev/pytorch-lightning that referenced this pull request Apr 4, 2020
…I#1357)

* Make training_epoch_end behave like validation_epoch_end + minor fixes in docstrings.

* Minor fixes (Borda's comments).

* Detach tensors in batch_output (to avoid possible memory leak) + doc fix.

Co-authored-by: Jean-Baptiste SCHIRATTI <jean-baptisteschiratti@MacBook-Pro-de-Jean-Baptiste.local>
alexeykarnachev pushed a commit to alexeykarnachev/pytorch-lightning that referenced this pull request Apr 4, 2020
* Doc fixes from Lightning-AI#1357 (awaelchli's comments) + changelog.

* Fix indentation.

* Add blank line to fix doc build?
@Borda Borda mentioned this pull request Apr 4, 2020
tullie pushed a commit to tullie/pytorch-lightning that referenced this pull request Jun 7, 2020
* Doc fixes from Lightning-AI#1357 (awaelchli's comments) + changelog.

* Fix indentation.

* Add blank line to fix doc build?
@Borda Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Log training metrics for each epoch
6 participants