Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: nb is set total number of devices, when nb is -1. #4209

Merged
merged 13 commits into from
Oct 29, 2020
Merged

fix: nb is set total number of devices, when nb is -1. #4209

merged 13 commits into from
Oct 29, 2020

Conversation

ssaru
Copy link
Contributor

@ssaru ssaru commented Oct 17, 2020

What does this PR do?

Fixes #4207 (issue)

Before
auto_select_gpus

if auto_select_gpus enabled and gpus is an integer, pick available gpus automatically.

but, if you set gpus is -1, raise MisconfigurationException("GPUs requested but none are available.")

./pytorch_lightning/tuner/auto_gpu_select.py

def pick_multiple_gpus(nb):    
    picked = []
    # Even though the total number of devices is 1 or more, the for loop returns an empty list when `nb` is -1.
    for _ in range(nb):
        picked.append(pick_single_gpu(exclude_gpus=picked))

    return picked

After
nb is set total number of devices, when nb is -1.

./pytorch_lightning/tuner/auto_gpu_select.py

def pick_multiple_gpus(nb):
    nb = torch.cuda.device_count() if nb == -1 else nb
    picked = []
    for _ in range(nb):
        picked.append(pick_single_gpu(exclude_gpus=picked))

    return picked

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

(cc. @inmoonlight)

@mergify mergify bot requested a review from a team October 17, 2020 17:28
@codecov
Copy link

codecov bot commented Oct 17, 2020

Codecov Report

Merging #4209 into master will decrease coverage by 0%.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #4209   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         111     111           
  Lines        8067    8127   +60     
======================================
+ Hits         7488    7536   +48     
- Misses        579     591   +12     

@Borda Borda added the bug Something isn't working label Oct 17, 2020
@lezwon
Copy link
Contributor

lezwon commented Oct 18, 2020

@ssaru Nice find :) I was wondering if we could just ignore auto_select_gpus when gpus is set to -1. That way we can avoid calling pick_multiple_gpus. What do you think?

@ssaru
Copy link
Contributor Author

ssaru commented Oct 18, 2020

@lezwon
Actually, i don't know that why separate auto_select_gpus, gpus because pick all available GPUs when gpus set is -1 in gpus parameter in Trainer class

Moreover, i thinks two options, auto_select_gpus, gpus confuses people. if we want to pick all available gpus automatically in PyTorch-lightning, which of the 10 cases should we choose in a combined 5 types gpus and 2 types auto_select_gpus?

It's too ambiguous.

My suggestion is integrate auto_select_gpus option to gpus option. then, exceptions are resolved with warning messages or interactively.

P.S.
I can't find the logic that detects the process used GPUs in /pytorch_lightning/tuner/auto_gpu_select.py.
I think it should detect the process used GPUs in pick_multiple_gpus below. but didn't.
Is this a feature to be released in the next version?

Thank you for your effort 👍

import torch


def pick_multiple_gpus(nb):
    nb = torch.cuda.device_count() if nb == -1 else nb
    picked = []
    for _ in range(nb):
        picked.append(pick_single_gpu(exclude_gpus=picked))

    return picked


def pick_single_gpu(exclude_gpus: list):
    for i in range(torch.cuda.device_count()):
        if i in exclude_gpus:
            continue
        # Try to allocate on device:
        device = torch.device(f"cuda:{i}")
        try:
            torch.ones(1).to(device)
        except RuntimeError:
            continue
        return i
    raise RuntimeError("No GPUs available.")

(cc. @inmoonlight)

@lezwon
Copy link
Contributor

lezwon commented Oct 19, 2020

auto_select_gpus comes in handy when some GPU's might not be available. In that scenario, your change does make sense as attempting to pick all GPU's might throw an error. Mind adding a test for this change?

Also how do you propose we combine auto_select_gpus and gpus. As I see it, auto_select_gpus serves a specific purpose. @rohitgr7 any thoughts on this?

@inmoonlight
Copy link

inmoonlight commented Oct 19, 2020

@lezwon

As stated in the document (described below), how about this way?

When specifying number of gpus as an integer gpus=k, setting the trainer flag auto_select_gpus=True will automatically help you find k gpus that are not occupied by other processes. This is especially useful when GPUs are configured to be in “exclusive mode”, such that only one process at a time can access them.

# specifies all GPUs regardless of its availability
Trainer(gpus=-1, auto_select_gpus=False)

# specified all "available" GPUs. If only one GPU is not occupied, use one gpu.
Trainer(gpus=-1, auto_select_gpus=True)

(cc. @ssaru)

@lezwon
Copy link
Contributor

lezwon commented Oct 19, 2020

@inmoonlight yep. This PR would do that. If @ssaru can add a test, we can go ahead and merge it. :)

@edenlightning edenlightning added this to the 1.0.3 milestone Oct 19, 2020
Copy link
Contributor

@s-rog s-rog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just needs a test!

Edit:
#2852 Might have to look at this pending PR as well to make sure everything works

@mergify mergify bot requested a review from a team October 20, 2020 01:13
@inmoonlight
Copy link

inmoonlight commented Oct 20, 2020

@s-rog @lezwon Sounds great! In addition to the test code, I suggest updating the document since the term "available" might misleading

image

     1. test combination `auto_select_gpus`, `gpus` options using
Trainer
     2. test `pick_multiple_gpus` function directly

Refs: #4207
@pep8speaks
Copy link

pep8speaks commented Oct 20, 2020

Hello @ssaru! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-29 09:25:05 UTC

@ssaru
Copy link
Contributor Author

ssaru commented Oct 20, 2020

@lezwon

I finished my work as below

  1. Add ResourceWarning to avoid confusion when set auto_select_gpus=True, gpus=0
  2. add the test code
  3. modify Select GPU devices docs in MULTI-GPU TRAINING

(cc. @inmoonlight )

@mergify mergify bot requested a review from a team October 20, 2020 10:17
@mergify mergify bot requested a review from a team October 20, 2020 10:20
docs/source/multi_gpu.rst Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team October 20, 2020 10:24
@mergify mergify bot requested review from a team October 20, 2020 10:25
@mergify mergify bot requested a review from a team October 20, 2020 10:31
@mergify mergify bot requested a review from a team October 20, 2020 10:33
@mergify mergify bot requested a review from a team October 21, 2020 14:21
@mergify mergify bot requested a review from a team October 21, 2020 15:07
@mergify mergify bot requested a review from a team October 21, 2020 16:13
@ssaru
Copy link
Contributor Author

ssaru commented Oct 22, 2020

@s-rog

I think for now add a test with:
gpus=-1, auto_select_gpus=True
@pytest.mark.skipif(torch.cuda.device_count() > 0, reason="test requires no GPUs be available")
a pytest.raises similar to your test_pick_multiple_gpus test
and you should be good to go! 👍

The above test scenario throws a MisconfigurationException("GPUs requested but none are available.") at "pytorch_lightning/utilities/device_parser.py:75"(refer as below) because the machine doesn't have a GPU

pytorch_lightning/utilities/device_parser.py:75

gpus = _normalize_parse_gpu_string_input(gpus)
gpus = _normalize_parse_gpu_input_to_list(gpus)
if not gpus:
    raise MisconfigurationException("GPUs requested but none are available.")

For more exactly testing, it's better to instantiate Trainer with auto_select_gpus=True, gpus=-1 while another process is in training.

What do you think?

     1. improve code, test code readability
     2. move up example in multi_gpu.rst to trainer/__init__.py

 Refs: #4207
     1. improve code, test code readability
     2. move up example in multi_gpu.rst to trainer/__init__.py

 Refs: #4207
@s-rog
Copy link
Contributor

s-rog commented Oct 22, 2020

@ssaru sounds good!

@ssaru ssaru requested review from ydcjeff and removed request for a team October 24, 2020 07:30
Copy link
Contributor

@ydcjeff ydcjeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR. LGTM!

@ydcjeff ydcjeff added the ready PRs ready to be merged label Oct 26, 2020
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great extension, would be nice to have it in the changelog :)

@SkafteNicki SkafteNicki merged commit b459fd2 into Lightning-AI:master Oct 29, 2020
Borda pushed a commit that referenced this pull request Nov 4, 2020
* fix: `nb` is set total number of devices, when nb is -1.

 Refs: #4207

* feat: add test code
     1. test combination `auto_select_gpus`, `gpus` options using
Trainer
     2. test `pick_multiple_gpus` function directly

Refs: #4207

* docs: modify contents in `Select GPU devices`

 Refs: #4207

* refactore: reflect the reuslt of review

 Refs: #4207

* refactore: reflect the reuslt of review

 Refs: #4207

* Update CHANGELOG.md

Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Roger Shieh <55400948+s-rog@users.noreply.github.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>
(cherry picked from commit b459fd2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<auto_select_gpus=True, gpus=-1> raise MisconfigurationException("GPUs requested but none are available.")