[Fix] Ensure we set the eval/train flag correctly on accelerator model #6877

SeanNaren · 2021-04-07T20:39:35Z

What does this PR do?

This PR is ready, with more information underneath.

We also require facebookresearch/fairscale#587 to be merged and included in next release, but since we do not rely on the upstream release yet, this fix isn't going to change any FairScale related tests on CI. This will probably also block #6152 as we move from our own fork to the latest FairScale release.

In terms of testing, once #6152 is merged to deprecate the current Pipe implementation, this fix will come to fruition as the FairScale version will be reliant on the upstream release.

Overall however, I notice in many places we call train eval on just the lightning module, not the accelerator wrapped model. In the case of custom implementations like SDP, this is an issue. I swapped this to using self.model which should recursively set the lightning module when it is wrapped.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

ananthsub

great catch!!! let's add comments here for when to use self.model vs when to directly use self.lightning_module

SeanNaren · 2021-04-07T21:07:00Z

great catch!!! let's add comments here for when to use self.model vs when to directly use self.lightning_module

Thanks @ananthsub!

The difference from the names can be quite confusing, and in most cases the choice is clearer; if you want to access lightning_module internals/functions, use self.trainer.lightning_module. And eventually for FSDP or where you'd like to access the wrapped model, you can use self.trainer.model or self.model in the trainer :)

pytorch_lightning/trainer/trainer.py

SeanNaren · 2021-04-08T09:12:56Z

Ah great point, forgot lightning_module.eval() and lightning_module.train() are hooks. What's advised here? I could change the default to self.trainer.model.eval() and self.trainer.model.train() within the hooks or I could add this additional logic outside the hook. cc @awaelchli @tchaton

tchaton · 2021-04-08T09:22:35Z

Ah great point, forgot lightning_module.eval() and lightning_module.train() are hooks. What's advised here? I could change the default to self.trainer.model.eval() and self.trainer.model.train() within the hooks or I could add this additional logic outside the hook. cc @awaelchli @tchaton

Sounds good to me !

SeanNaren · 2021-04-08T09:29:00Z

Thanks @tchaton just a slight mistake I should be referring to the module hooks on_test_model_train on_test_model_eval on_predict_model_eval that will change.

pytorch_lightning/trainer/trainer.py

awaelchli

looks good <3

tchaton

LGTM !

carmocca · 2021-04-08T11:36:41Z

@SeanNaren the test is failing

…ed in the setup function

SeanNaren · 2021-04-08T11:43:04Z

The test showed that we call the predict hook twice, I've remedied this and it now follows the same behaviour as run_evaluation and run_train

codecov · 2021-04-08T11:47:30Z

Codecov Report

Merging #6877 (b8d473f) into master (19e67d1) will decrease coverage by 6%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #6877    +/-   ##
=======================================
- Coverage      91%     86%    -6%     
=======================================
  Files         193     193            
  Lines       12299   12539   +240     
=======================================
- Hits        11252   10779   -473     
- Misses       1047    1760   +713

#6877) * Ensure we move the model to eval mode before running evaluation * Ensure we set the flag appropriately across all stages * Add test, move hooks logic * Apply same fix to the validate loop * Update pytorch_lightning/trainer/trainer.py * Fix function name * Fix order, add predict * Shorten the name * Fix input dm, drop duplicate on predict start hook call, as it's called in the setup function * Use hook, remove double call (cherry picked from commit 742c48e)

SeanNaren added 2 commits April 7, 2021 21:15

Ensure we move the model to eval mode before running evaluation

f9fdcea

Ensure we set the flag appropriately across all stages

5555ca8

SeanNaren requested review from tchaton and williamFalcon as code owners April 7, 2021 20:39

SeanNaren added this to the 1.2.x milestone Apr 7, 2021

SeanNaren changed the title ~~Fix/sharded eval~~ [Fix] Ensure we set the eval/train flag correctly on accelerator model Apr 7, 2021

SeanNaren self-assigned this Apr 7, 2021

SeanNaren added the bug Something isn't working label Apr 7, 2021

SeanNaren requested a review from a team April 7, 2021 20:41

ananthsub approved these changes Apr 7, 2021

View reviewed changes

awaelchli reviewed Apr 7, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

tchaton reviewed Apr 8, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

Add test, move hooks logic

1ecdd23

SeanNaren requested review from Borda, carmocca and justusschock as code owners April 8, 2021 09:53

SeanNaren enabled auto-merge (squash) April 8, 2021 09:53

Apply same fix to the validate loop

4b10fdb

SeanNaren commented Apr 8, 2021

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

Sean Naren added 3 commits April 8, 2021 10:58

Update pytorch_lightning/trainer/trainer.py

d642932

Fix function name

c3abba5

Fix order, add predict

42b99e3

awaelchli approved these changes Apr 8, 2021

View reviewed changes

Shorten the name

778e026

tchaton approved these changes Apr 8, 2021

View reviewed changes

SeanNaren added the _Will label Apr 8, 2021

carmocca approved these changes Apr 8, 2021

View reviewed changes

Borda approved these changes Apr 8, 2021

View reviewed changes

SeanNaren added 2 commits April 8, 2021 12:39

Fix input dm, drop duplicate on predict start hook call, as it's call…

d6b8607

…ed in the setup function

Use hook, remove double call

b8d473f

ananthsub approved these changes Apr 8, 2021

View reviewed changes

lexierule disabled auto-merge April 8, 2021 18:04

lexierule merged commit 742c48e into master Apr 8, 2021

lexierule deleted the fix/sharded_eval branch April 8, 2021 18:04

SeanNaren mentioned this pull request Apr 12, 2021

1.2.x cherries 🍒 #6083

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Ensure we set the eval/train flag correctly on accelerator model #6877

[Fix] Ensure we set the eval/train flag correctly on accelerator model #6877

SeanNaren commented Apr 7, 2021 •

edited

Loading

ananthsub left a comment

SeanNaren commented Apr 7, 2021

SeanNaren commented Apr 8, 2021

tchaton commented Apr 8, 2021

SeanNaren commented Apr 8, 2021

awaelchli left a comment

tchaton left a comment

carmocca commented Apr 8, 2021

SeanNaren commented Apr 8, 2021

codecov bot commented Apr 8, 2021 •

edited

Loading

[Fix] Ensure we set the eval/train flag correctly on accelerator model #6877

[Fix] Ensure we set the eval/train flag correctly on accelerator model #6877

Conversation

SeanNaren commented Apr 7, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

ananthsub left a comment

Choose a reason for hiding this comment

SeanNaren commented Apr 7, 2021

SeanNaren commented Apr 8, 2021

tchaton commented Apr 8, 2021

SeanNaren commented Apr 8, 2021

awaelchli left a comment

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

carmocca commented Apr 8, 2021

SeanNaren commented Apr 8, 2021

codecov bot commented Apr 8, 2021 • edited Loading

Codecov Report

SeanNaren commented Apr 7, 2021 •

edited

Loading

codecov bot commented Apr 8, 2021 •

edited

Loading