Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor setup_training and remove test_mode #5388

Merged
merged 14 commits into from
Jan 13, 2021
Merged

Conversation

rohitgr7
Copy link
Contributor

@rohitgr7 rohitgr7 commented Jan 6, 2021

What does this PR do?

Clean up and split setup_training into setup_trainer & setup_training to avoid call to on_pretrain_routine_* while testing. Removed redundant test_mode, use trainer.testing instead. Also a few minor updates, to make sure tests passes locally.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Check that target branch and milestone match!

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Jan 6, 2021

Codecov Report

Merging #5388 (2e65f6a) into master (83b1ff4) will decrease coverage by 0%.
The diff coverage is 98%.

@@           Coverage Diff           @@
##           master   #5388    +/-   ##
=======================================
- Coverage      93%     93%    -0%     
=======================================
  Files         135     135            
  Lines       10015    9863   -152     
=======================================
- Hits         9338    9173   -165     
- Misses        677     690    +13     

@pep8speaks
Copy link

pep8speaks commented Jan 7, 2021

Hello @rohitgr7! Thanks for updating this PR.

Line 148:13: W503 line break before binary operator
Line 152:13: W503 line break before binary operator

Comment last updated at 2021-01-08 20:02:13 UTC

@rohitgr7 rohitgr7 marked this pull request as ready for review January 7, 2021 20:30
@rohitgr7 rohitgr7 changed the title Refactor setup_training Refactor setup_training and remove test_mode Jan 7, 2021
@rohitgr7 rohitgr7 added refactor bug Something isn't working labels Jan 7, 2021
@rohitgr7 rohitgr7 added this to the 1.1.x milestone Jan 7, 2021
@Borda Borda added design Includes a design discussion checkpointing Related to checkpointing feature Is an improvement or enhancement and removed checkpointing Related to checkpointing labels Jan 7, 2021
@awaelchli
Copy link
Contributor

@rohitgr7 regarding accelerators, please have a look at #4510 #5385. we are working on some large scale refactoring.

@rohitgr7
Copy link
Contributor Author

rohitgr7 commented Jan 8, 2021

@awaelchli I see. Will remove other changes in the accelerators except setup_training -> setup_trainer. Will that be ok?

Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work ! Could you check self.logger_connector.set_stage. It is duplicating the information.
Would be better to have a higher level State for this.
self.logger_connector.set_stage was introduced as validation can be performed during training and need to change store.

And it would be good to have trainer.training = True argument and not rely on the model for it.

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
pytorch_lightning/accelerators/accelerator.py Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Show resolved Hide resolved
@rohitgr7
Copy link
Contributor Author

rohitgr7 commented Jan 8, 2021

add internal phase for the trainer as enum for Train/Valid/Test as we seem to have LoggerStages

@Borda need help with this

@rohitgr7 rohitgr7 requested a review from Borda January 8, 2021 20:07
@Borda
Copy link
Member

Borda commented Jan 8, 2021

add internal phase for the trainer as enum for Train/Valid/Test as we seem to have LoggerStages

@Borda need help with this

I made a draft for it here #5419 🐰

@Borda Borda added the ready PRs ready to be merged label Jan 11, 2021
@awaelchli awaelchli mentioned this pull request Jan 11, 2021
9 tasks
@rohitgr7
Copy link
Contributor Author

@SeanNaren @tchaton mind review??

@rohitgr7 rohitgr7 enabled auto-merge (squash) January 13, 2021 19:48
@rohitgr7 rohitgr7 merged commit d916973 into master Jan 13, 2021
@rohitgr7 rohitgr7 deleted the ref/setup_training branch January 13, 2021 20:30
@@ -412,6 +412,46 @@ def __init__(
# Callback system
self.on_init_end()

def setup_trainer(self, model: LightningModule):
Copy link
Contributor

@ananthsub ananthsub Jan 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this refactor broke this functionality:

trainer = Trainer(resume_from_checkpoint=ckpt)
trainer.test()

I don't see where self.trainer.checkpoint_connector.restore_weights() is called in the evaluation loop or in the trainer. before this happened as part of the TrainLoop's setup_training

In Trainer's run_evaluation I don't see any calls to the checkpoint connector either. Is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is done in setup_training here: https://github.com/PyTorchLightning/pytorch-lightning/pull/5388/files#diff-6b21474ed45079f01dfa45ee3b9d40d23efe693c9c005ce897e25384ef425349

as of now restore is done while calling .fit. Previously it was happening during .test() too and was reloading the resume_from_checkpoint state and model weights that were wrong. But for your use-case, I opened an issue for reloading the callback states optionally.

@justusschock justusschock mentioned this pull request Feb 2, 2021
16 tasks
Borda pushed a commit that referenced this pull request Feb 4, 2021
* ref and fix call for on_pretrained_routine

* avoid failing tests

* unnecessary_call

* unnecessary call in accelerators

* tmpdir

* rm test_mode

* pep

* updates

* more ref

* Revert "more ref"

This reverts commit 5d9e95f.

* more refac

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working design Includes a design discussion feature Is an improvement or enhancement ready PRs ready to be merged refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants