Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple fix for memory leak on GPU0 #1094

Closed
wants to merge 7 commits into from

Conversation

shubhamagarwal92
Copy link
Contributor

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #958 related to memory leak. Pretty old PyTorch issue related to setting device.

  1. Set the device directly while finding the root gpu in determine_root_gpu_device in trainer.distrib_parts.py
  2. Remove also the hack from trainer.evaluation_loop

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃 lol. Yeah I did!

@pep8speaks
Copy link

pep8speaks commented Mar 8, 2020

Hello @shubhamagarwal92! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-31 16:50:39 UTC

@shubhamagarwal92
Copy link
Contributor Author

@williamFalcon Could you suggest why these tests are failing.

@awaelchli
Copy link
Contributor

@shubhamagarwal92 likely because you are calling torch.cuda.* when cuda is not available or test expects it to run on cpu.

@shubhamagarwal92
Copy link
Contributor Author

shubhamagarwal92 commented Mar 9, 2020

@shubhamagarwal92 likely because you are calling torch.cuda.* when cuda is not available or test expects it to run on cpu.

@awaelchli I am assuming that we would determine_root_gpu_device only when cuda is available. Anyways, I have added a simple check root_gpu >= 0 before setting the device in torch.cuda.set_device().

Omitting this usually causes this memory leak on GPU0 as suggested by Soumith here. I am not sure if this is the right place to set the device though.

Any suggestions?

@Borda
Copy link
Member

Borda commented Mar 11, 2020

hey there, we have added GPU CI test, so could we kindly ask to rebase/merge master which will trigger these tests so we do not need to test it manually... Thx for your understanding 🤖

@Borda Borda added this to the 0.7.2 milestone Mar 12, 2020
@Borda Borda added the bug Something isn't working label Mar 12, 2020
@shubhamagarwal92
Copy link
Contributor Author

shubhamagarwal92 commented Mar 12, 2020

@Borda I have rebased/merged the master! Could you suggest any reason?

@shubhamagarwal92 shubhamagarwal92 mentioned this pull request Mar 12, 2020
5 tasks
@Borda
Copy link
Member

Borda commented Mar 14, 2020

@shubhamagarwal92 could you please rather rebase on master because now it shows all recent master edits and it had to check your changes...
it seems that there some issue while asking for nb GPU on no-GPU machine

@shubhamagarwal92
Copy link
Contributor Author

@Borda I don't know why I am having multiple replicas of the same commit when I am trying to merge this repo's master. For now, I have force-pushed the commit before merge.

Basically, in this PR, I am trying to set device here

@Borda Borda requested a review from jeffling March 17, 2020 13:54
# set cuda device to root gpu
# related to https://github.com/PyTorchLightning/pytorch-lightning/issues/958
# Refer solution: https://github.com/pytorch/pytorch/issues/9871#issuecomment-408304190
# root_device = torch.device("cuda", root_gpu)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this now

# related to https://github.com/PyTorchLightning/pytorch-lightning/issues/958
# Refer solution: https://github.com/pytorch/pytorch/issues/9871#issuecomment-408304190
# root_device = torch.device("cuda", root_gpu)
root_device = (torch.device("cuda", root_gpu) if root_gpu >= 0 else torch.device("cpu"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would root_device be set if the user wants CPU? None? -1? Maybe we should check for that explicitly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user wants CPU, the function determine_root_gpu_device should not be called (?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case it is getting called with gpus=None, and returns None (see first lines of determine_root_gpu_device). So your else torch.device("cpu")) is never relevant.
I don't see anything wrong with just
root_device = torch.device("cuda", root_gpu)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think the device should be set outside this function anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli where do you suggest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would search the code base for occurrences of self.root_gpu and check whether it is needed to set the device in each case.
Maybe consider setting directly after here but I am not sure if that's the best place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to ask the core team on this :)

@williamFalcon
Copy link
Contributor

@Borda is this dead?

@Borda
Copy link
Member

Borda commented Mar 30, 2020

@shubhamagarwal92 @awaelchli how is it going?

@awaelchli
Copy link
Contributor

If I understand correctly, the fix to the memory leak is simply to add a torch.cuda.set_device(root_device) before training. The question is where is the best place to put it. I made a suggestion (see comment above).

@shubhamagarwal92
Copy link
Contributor Author

@awaelchli @Borda @williamFalcon

I have incorporated the suggestions. See the latest commit 6cba621

Could you please have a look. Thanks.

@Borda
Copy link
Member

Borda commented Mar 31, 2020

seems to fail on returned None:

        root_device = (torch.device("cuda", self.root_gpu)
>                      if self.root_gpu >= 0 else torch.device("cpu"))
E       TypeError: '>=' not supported between instances of 'NoneType' and 'int'

@williamFalcon
Copy link
Contributor

@shubhamagarwal92 this is great. mind fixing the issue so we can get into 0.7.2? :)

@shubhamagarwal92
Copy link
Contributor Author

@williamFalcon I updated the if-check in 2c7b802

but I do not understand why the automated checks are failing, such as this requires a cuda device:
https://github.com/PyTorchLightning/pytorch-lightning/pull/1094/checks?check_run_id=549408374#step:9:252

I can only tell that this is probably not the right place to set the device (?) Any suggestions?

if isinstance(self.data_parallel_device_ids, list):
root_gpu = self.data_parallel_device_ids[0]
root_device = (torch.device("cuda", root_gpu)
if root_gpu else torch.device("cpu"))
torch.cuda.set_device(root_device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to add tpu device...

@williamFalcon
Copy link
Contributor

williamFalcon commented Apr 2, 2020

@shubhamagarwal92 that's not where the device should be set. i put it in the correct place in #1349.

Let's just merge that one. I think it will still credit you as a co-author

@shubhamagarwal92
Copy link
Contributor Author

@shubhamagarwal92 that's not where the device should be set. i put it in the correct place in #1349.

Let's just merge that one. I think it will still credit you as a co-author

Thanks for doing this. Ahh, now I see where the issue was. Should we close this PR then?

williamFalcon pushed a commit that referenced this pull request Apr 3, 2020
williamFalcon added a commit that referenced this pull request Apr 3, 2020
* SA: for #958: set torch cuda device when finding root

* SA: for #958: removing root gpu hack in trainer/evaluation_loop

* SA: setting torch cuda device

* comment line too long

* check if root gpu exists or available

* Incorporating suggestions on #1094

* since root gpu returns none instead of -1 for cpu

* undo changes

* fixed dp memory thing

Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>
alexeykarnachev pushed a commit to alexeykarnachev/pytorch-lightning that referenced this pull request Apr 4, 2020
* SA: for Lightning-AI#958: set torch cuda device when finding root

* SA: for Lightning-AI#958: removing root gpu hack in trainer/evaluation_loop

* SA: setting torch cuda device

* comment line too long

* check if root gpu exists or available

* Incorporating suggestions on Lightning-AI#1094

* since root gpu returns none instead of -1 for cpu

* undo changes

* fixed dp memory thing

Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>
tullie pushed a commit to tullie/pytorch-lightning that referenced this pull request Jun 7, 2020
* SA: for Lightning-AI#958: set torch cuda device when finding root

* SA: for Lightning-AI#958: removing root gpu hack in trainer/evaluation_loop

* SA: setting torch cuda device

* comment line too long

* check if root gpu exists or available

* Incorporating suggestions on Lightning-AI#1094

* since root gpu returns none instead of -1 for cpu

* undo changes

* fixed dp memory thing

Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>
@Borda Borda modified the milestones: v0.7., v0.7.x Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Process runs on more GPUs than specified
7 participants