[Fix] Move log value to cpu. #4592

tchaton · 2020-11-09T18:29:47Z

What does this PR do?

This PR introduces fixes for gpu leak.
It adds a move_metrics_to_cpu parameter to the Trainer to force all result objects to be moved to cpu.

Fixes #4556

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In in short, see following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified; Bugfixes should be including in bug-fix release milestones (m.f.X) and features should be included in (m.X.b) releases.

Did you have fun?

Make sure you had fun coding 🙃

Vozf · 2020-11-09T19:08:04Z

pytorch_lightning/core/step_result.py

@@ -136,6 +136,10 @@ def log(
        if sync_dist and isinstance(value, (torch.Tensor, numbers.Number)):
            value = sync_fn(value, group=sync_dist_group, reduce_op=sync_dist_op)

+        # no need to keep on gpu
+        if isinstance(value, torch.Tensor) and value.is_cuda:
+            value = value.cpu()


Shouldn't there also be a detach()?

# no metrics should be logged with graphs if not enable_graph and isinstance(value, torch.Tensor): value = value.detach() # sync across workers when using distributed training sync_fn = sync_fn or sync_ddp_if_available if sync_dist and isinstance(value, (torch.Tensor, numbers.Number)): value = sync_fn(value, group=sync_dist_group, reduce_op=sync_dist_op) # no need to keep on gpu if isinstance(value, torch.Tensor) and value.is_cuda: value = value.cpu()

detach is called just before.

pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py

pytorch_lightning/trainer/training_loop.py

Vozf · 2020-11-09T19:13:14Z

Have you also checked the same thing with validation loop?

tchaton · 2020-11-09T19:18:24Z

Hey @Vozf,

Did you join Slack ? If no, please do. So we can quickly resolve this bug :)

Best,
T.C

Vozf · 2020-11-09T19:26:09Z

Joined now as Vozf. I don't have any more questions except if it works for validation and testing if it fixes the bug in general :)

codecov · 2020-11-09T20:21:53Z

Codecov Report

Merging #4592 (5be63ce) into master (7e08b0d) will decrease coverage by 0%.
The diff coverage is 75%.

@@          Coverage Diff           @@
##           master   #4592   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         116     116           
  Lines        8883    8907   +24     
======================================
+ Hits         8278    8291   +13     
- Misses        605     616   +11

s-rog · 2020-11-10T02:51:54Z

Do we need a test for this as well? (assert tensor device cpu)

…chLightning/pytorch-lightning into bugfix/4556_gpu_memory_log

Borda · 2020-11-10T10:56:25Z

lets try the .detach() intead?

justusschock

This is fine to me, just some minor question

pytorch_lightning/trainer/trainer.py

pytorch_lightning/utilities/memory.py

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

pytorch_lightning/core/step_result.py

pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

…onnector.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

SeanNaren · 2020-11-11T00:12:32Z

@edenlightning swapping this to 1.1 since its tied to the logging refactor.

* move value to cpu to save memory * update * move to cpu * try something * update * update * add back out_dict.update({k: v}) * add move_metrics_to_cpu * update * Update pytorch_lightning/utilities/memory.py Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> * resolve comments * Update pytorch_lightning/core/step_result.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> * Update pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com> Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com> Co-authored-by: Sean Naren <sean.narenthiran@gmail.com>

move value to cpu to save memory

819f213

tchaton changed the title ~~move value to cpu to save memory~~ [Fix] Move log value to cpu. Nov 9, 2020

tchaton added 2 commits November 9, 2020 18:38

update

2386c0f

move to cpu

477cd36

tchaton self-assigned this Nov 9, 2020

tchaton added the priority: 0 High priority task label Nov 9, 2020

tchaton modified the milestones: 1.0.x, 1.1 Nov 9, 2020

tchaton added the logger Related to the Loggers label Nov 9, 2020

tchaton marked this pull request as ready for review November 9, 2020 18:58

tchaton requested review from ananyahjha93, awaelchli, Borda, justusschock, nateraw, SeanNaren, teddykoker and williamFalcon as code owners November 9, 2020 18:58

Vozf suggested changes Nov 9, 2020

View reviewed changes

try something

abbd672

update

fc44142

s-rog approved these changes Nov 10, 2020

View reviewed changes

tchaton added 2 commits November 10, 2020 08:19

Merge branch 'master' into bugfix/4556_gpu_memory_log

e2fb444

update

858ea60

tchaton added 6 commits November 10, 2020 09:48

add back out_dict.update({k: v})

30afb93

add move_metrics_to_cpu

d9caa5d

Merge branch 'master' into bugfix/4556_gpu_memory_log

3a6244b

update

a57dcfe

Merge branch 'bugfix/4556_gpu_memory_log' of https://github.com/PyTor…

f30bba2

…chLightning/pytorch-lightning into bugfix/4556_gpu_memory_log

Merge branch 'master' into bugfix/4556_gpu_memory_log

bcd7908

Merge branch 'master' into bugfix/4556_gpu_memory_log

eff18c2

justusschock approved these changes Nov 10, 2020

View reviewed changes

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved

pytorch_lightning/utilities/memory.py Outdated Show resolved Hide resolved

tchaton and others added 2 commits November 10, 2020 18:34

Update pytorch_lightning/utilities/memory.py

be4e0f3

Co-authored-by: Justus Schock <12886177+justusschock@users.noreply.github.com>

resolve comments

d41273b

rohitgr7 approved these changes Nov 10, 2020

View reviewed changes

Borda approved these changes Nov 10, 2020

View reviewed changes

pytorch_lightning/core/step_result.py Show resolved Hide resolved

pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py Outdated Show resolved Hide resolved

tchaton and others added 2 commits November 10, 2020 19:38

Update pytorch_lightning/core/step_result.py

873e9c0

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Update pytorch_lightning/trainer/connectors/logger_connector/logger_c…

5e80eb9

…onnector.py Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

Vozf approved these changes Nov 10, 2020

View reviewed changes

Merge branch 'master' into bugfix/4556_gpu_memory_log

5be63ce

SeanNaren merged commit 514cb22 into master Nov 10, 2020

SeanNaren deleted the bugfix/4556_gpu_memory_log branch November 10, 2020 21:13

edenlightning modified the milestones: 1.1, 1.0.x Nov 10, 2020

SeanNaren modified the milestones: 1.0.x, 1.1 Nov 11, 2020

dvolgyes mentioned this pull request Feb 26, 2021

incorrect usage of detach/cpu/to #6214

Closed

carmocca mentioned this pull request Nov 18, 2021

[RFC] Depreceate the move_metrics_to_cpu Trainer argument. #10595

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Move log value to cpu. #4592

[Fix] Move log value to cpu. #4592

tchaton commented Nov 9, 2020 •

edited

Loading

Vozf Nov 9, 2020

tchaton Nov 9, 2020

Vozf commented Nov 9, 2020

tchaton commented Nov 9, 2020

Vozf commented Nov 9, 2020

codecov bot commented Nov 9, 2020 •

edited

Loading

s-rog commented Nov 10, 2020

Borda commented Nov 10, 2020

justusschock left a comment

SeanNaren commented Nov 11, 2020

[Fix] Move log value to cpu. #4592

[Fix] Move log value to cpu. #4592

Conversation

tchaton commented Nov 9, 2020 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

Vozf Nov 9, 2020

Choose a reason for hiding this comment

tchaton Nov 9, 2020

Choose a reason for hiding this comment

Vozf commented Nov 9, 2020

tchaton commented Nov 9, 2020

Vozf commented Nov 9, 2020

codecov bot commented Nov 9, 2020 • edited Loading

Codecov Report

s-rog commented Nov 10, 2020

Borda commented Nov 10, 2020

justusschock left a comment

Choose a reason for hiding this comment

SeanNaren commented Nov 11, 2020

tchaton commented Nov 9, 2020 •

edited

Loading

codecov bot commented Nov 9, 2020 •

edited

Loading