fix: `nb` is set total number of devices, when nb is -1. #4209

ssaru · 2020-10-17T17:28:11Z

What does this PR do?

Fixes #4207 (issue)

if auto_select_gpus enabled and gpus is an integer, pick available gpus automatically.

but, if you set gpus is -1, raise MisconfigurationException("GPUs requested but none are available.")

./pytorch_lightning/tuner/auto_gpu_select.py

def pick_multiple_gpus(nb):    
    picked = []
    # Even though the total number of devices is 1 or more, the for loop returns an empty list when `nb` is -1.
    for _ in range(nb):
        picked.append(pick_single_gpu(exclude_gpus=picked))

    return picked

After
nb is set total number of devices, when nb is -1.

./pytorch_lightning/tuner/auto_gpu_select.py

def pick_multiple_gpus(nb):
    nb = torch.cuda.device_count() if nb == -1 else nb
    picked = []
    for _ in range(nb):
        picked.append(pick_single_gpu(exclude_gpus=picked))

    return picked

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

(cc. @inmoonlight)

Refs: #4207

codecov · 2020-10-17T17:52:21Z

Codecov Report

Merging #4209 into master will decrease coverage by 0%.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #4209   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         111     111           
  Lines        8067    8127   +60     
======================================
+ Hits         7488    7536   +48     
- Misses        579     591   +12

lezwon · 2020-10-18T02:25:41Z

@ssaru Nice find :) I was wondering if we could just ignore auto_select_gpus when gpus is set to -1. That way we can avoid calling pick_multiple_gpus. What do you think?

ssaru · 2020-10-18T16:14:52Z

@lezwon
Actually, i don't know that why separate auto_select_gpus, gpus because pick all available GPUs when gpus set is -1 in gpus parameter in Trainer class

Moreover, i thinks two options, auto_select_gpus, gpus confuses people. if we want to pick all available gpus automatically in PyTorch-lightning, which of the 10 cases should we choose in a combined 5 types gpus and 2 types auto_select_gpus?

It's too ambiguous.

My suggestion is integrate auto_select_gpus option to gpus option. then, exceptions are resolved with warning messages or interactively.

P.S.
I can't find the logic that detects the process used GPUs in /pytorch_lightning/tuner/auto_gpu_select.py.
I think it should detect the process used GPUs in pick_multiple_gpus below. but didn't.
Is this a feature to be released in the next version?

Thank you for your effort 👍

import torch


def pick_multiple_gpus(nb):
    nb = torch.cuda.device_count() if nb == -1 else nb
    picked = []
    for _ in range(nb):
        picked.append(pick_single_gpu(exclude_gpus=picked))

    return picked


def pick_single_gpu(exclude_gpus: list):
    for i in range(torch.cuda.device_count()):
        if i in exclude_gpus:
            continue
        # Try to allocate on device:
        device = torch.device(f"cuda:{i}")
        try:
            torch.ones(1).to(device)
        except RuntimeError:
            continue
        return i
    raise RuntimeError("No GPUs available.")

(cc. @inmoonlight)

lezwon · 2020-10-19T02:47:36Z

auto_select_gpus comes in handy when some GPU's might not be available. In that scenario, your change does make sense as attempting to pick all GPU's might throw an error. Mind adding a test for this change?

Also how do you propose we combine auto_select_gpus and gpus. As I see it, auto_select_gpus serves a specific purpose. @rohitgr7 any thoughts on this?

inmoonlight · 2020-10-19T14:46:32Z

@lezwon

As stated in the document (described below), how about this way?

When specifying number of gpus as an integer gpus=k, setting the trainer flag auto_select_gpus=True will automatically help you find k gpus that are not occupied by other processes. This is especially useful when GPUs are configured to be in “exclusive mode”, such that only one process at a time can access them.

# specifies all GPUs regardless of its availability
Trainer(gpus=-1, auto_select_gpus=False)

# specified all "available" GPUs. If only one GPU is not occupied, use one gpu.
Trainer(gpus=-1, auto_select_gpus=True)

(cc. @ssaru)

lezwon · 2020-10-19T16:44:55Z

@inmoonlight yep. This PR would do that. If @ssaru can add a test, we can go ahead and merge it. :)

s-rog

LGTM, just needs a test!

Edit:
#2852 Might have to look at this pending PR as well to make sure everything works

inmoonlight · 2020-10-20T06:46:04Z

@s-rog @lezwon Sounds great! In addition to the test code, I suggest updating the document since the term "available" might misleading

1. test combination `auto_select_gpus`, `gpus` options using Trainer 2. test `pick_multiple_gpus` function directly Refs: #4207

pep8speaks · 2020-10-20T08:35:41Z

Hello @ssaru! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-29 09:25:05 UTC

Refs: #4207

ssaru · 2020-10-20T09:29:52Z

@lezwon

I finished my work as below

Add ResourceWarning to avoid confusion when set auto_select_gpus=True, gpus=0
add the test code
modify Select GPU devices docs in MULTI-GPU TRAINING

(cc. @inmoonlight )

pytorch_lightning/tuner/auto_gpu_select.py

tests/tuner/test_auto_gpu_select.py

docs/source/multi_gpu.rst

tests/tuner/test_auto_gpu_select.py

docs/source/multi_gpu.rst

ssaru · 2020-10-22T12:50:22Z

@s-rog

I think for now add a test with:
gpus=-1, auto_select_gpus=True
@pytest.mark.skipif(torch.cuda.device_count() > 0, reason="test requires no GPUs be available")
a pytest.raises similar to your test_pick_multiple_gpus test
and you should be good to go! 👍

The above test scenario throws a MisconfigurationException("GPUs requested but none are available.") at "pytorch_lightning/utilities/device_parser.py:75"(refer as below) because the machine doesn't have a GPU

pytorch_lightning/utilities/device_parser.py:75

gpus = _normalize_parse_gpu_string_input(gpus)
gpus = _normalize_parse_gpu_input_to_list(gpus)
if not gpus:
    raise MisconfigurationException("GPUs requested but none are available.")

For more exactly testing, it's better to instantiate Trainer with auto_select_gpus=True, gpus=-1 while another process is in training.

What do you think?

1. improve code, test code readability 2. move up example in multi_gpu.rst to trainer/__init__.py Refs: #4207

s-rog · 2020-10-22T16:27:38Z

@ssaru sounds good!

ydcjeff

Thank you for the PR. LGTM!

awaelchli

great extension, would be nice to have it in the changelog :)

* fix: `nb` is set total number of devices, when nb is -1. Refs: #4207 * feat: add test code 1. test combination `auto_select_gpus`, `gpus` options using Trainer 2. test `pick_multiple_gpus` function directly Refs: #4207 * docs: modify contents in `Select GPU devices` Refs: #4207 * refactore: reflect the reuslt of review Refs: #4207 * refactore: reflect the reuslt of review Refs: #4207 * Update CHANGELOG.md Co-authored-by: chaton <thomas@grid.ai> Co-authored-by: Roger Shieh <55400948+s-rog@users.noreply.github.com> Co-authored-by: Nicki Skafte <skaftenicki@gmail.com> (cherry picked from commit b459fd2)

fix: nb is set total number of devices, when nb is -1.

c2410af

Refs: #4207

mergify bot requested a review from a team October 17, 2020 17:28

Borda added the bug Something isn't working label Oct 17, 2020

edenlightning added this to the 1.0.3 milestone Oct 19, 2020

s-rog reviewed Oct 20, 2020

View reviewed changes

mergify bot requested a review from a team October 20, 2020 01:13

feat: add test code

d16e170

1. test combination `auto_select_gpus`, `gpus` options using Trainer 2. test `pick_multiple_gpus` function directly Refs: #4207

docs: modify contents in Select GPU devices

8b0713b

Refs: #4207