Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUs requested but none are available #3542

Closed
shanhaidexiamo opened this issue Sep 18, 2020 · 12 comments
Closed

GPUs requested but none are available #3542

shanhaidexiamo opened this issue Sep 18, 2020 · 12 comments
Labels
question Further information is requested waiting on author Waiting on user action, correction, or update won't fix This will not be worked on

Comments

@shanhaidexiamo
Copy link

my server has 8 GPUs, but when I use the trainer class and set gpus = -1, it gets the run error GPUs requested but none are available, use torch to check the gpus , get the number of gpu is 8, and cuda.is_available is true. Does any one can tell me what's wrong ?

@shanhaidexiamo shanhaidexiamo added the question Further information is requested label Sep 18, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@Borda
Copy link
Member

Borda commented Sep 18, 2020

Mind check if you have installed CUDA version of your PT as torch.cuda.is_available()

@Borda Borda added the waiting on author Waiting on user action, correction, or update label Sep 18, 2020
@awaelchli
Copy link
Contributor

awaelchli commented Sep 19, 2020

cuda.is_available is true

@shanhaidexiamo what does
torch.cuda.device_count()
return?
cuda.is_available alone does not tell us if the gpus are visible to torch.

@kyoungrok0517
Copy link

kyoungrok0517 commented Oct 1, 2020

cuda.is_available is true

@shanhaidexiamo what does
torch.cuda.device_count()
return?
cuda.is_available alone does not tell us if the gpus are visible to torch.

Sorry to interrupt but I'm experiencing the same issue. the device_count() returns 2 in my case, and I'm running on GCP instance with two V100. I had no problem with my own server so it's strange (though the GPU model is different). pytorch-lightning==0.9.0

This is the env

* CUDA:
        - GPU:
                - Tesla V100-SXM2-16GB
                - Tesla V100-SXM2-16GB
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.18.5
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 0.9.0
        - tqdm:              4.47.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.8.3
        - version:           #24-Ubuntu SMP Sat Sep 5 02:07:13 UTC 2020

GCP command to make the similar instance

gcloud beta compute --project <project> instances create <instance-name> --zone=us-central1-a --machine-type=n1-standard-16 --subnet=default --network-tier=PREMIUM --maintenance-policy=TERMINATE --service-account=<service_account> --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/trace.append --accelerator=type=nvidia-tesla-v100,count=2 --image=ubuntu-2004-focal-v20200917 --image-project=ubuntu-os-cloud --boot-disk-size=200GB --boot-disk-type=pd-standard --boot-disk-device-name=thesis-1 --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --reservation-affinity=any

@Borda
Copy link
Member

Borda commented Oct 1, 2020

@kyoungrok0517 mind sharing this output, just to check that you have properly installed PT and drivers...
python -c "import torch ; print(torch.cuda.device_count())"

@kyoungrok0517
Copy link

kyoungrok0517 commented Oct 1, 2020

@Borda That returns 2 as expected. If I use gpus = -1 argument the lightning doesn't work as I described. But if I give the exact number of gpus (e.g. gpus = 2) it works fine. I'm using ddp as the backend.

@Borda
Copy link
Member

Borda commented Oct 1, 2020

@kyoungrok0517 good catch, mind sending PR?

@awaelchli
Copy link
Contributor

@williamFalcon is working on the parsing of gpus for DDP. The error is most likely because they are not correctly passed to or parsed in the child process.

@kyoungrok0517
Copy link

@kyoungrok0517 good catch, mind sending PR?

Hmm... so am I late to PR? I never did this before so it'll be grateful if you guide me how it works. I make pull request even though I don't know how to fix it?? Please tell me. I'd like to help.

@stale
Copy link

stale bot commented Nov 1, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Nov 1, 2020
@stale stale bot closed this as completed Nov 8, 2020
@fishbotics
Copy link

fishbotics commented Apr 5, 2021

Hi all,

[EDIT] I meant gpus=-1

I am now having the same issue. I'm running my job on a server with 8 GPUs. When I run python -c "import torch ; print(torch.cuda.device_count())" I get 8, but when I run with gpus=1 and auto_select_gpus=True, I get an error saying that there are no GPUs available.

@Borda , @awaelchli : do you know if this was ever fixed?

Thanks!

@awaelchli
Copy link
Contributor

@fishbotics What else is running on the gpus?

The original issue reported here was fixed by #4209, I believe.
adding gpus=1 and auto_select_gpus=True works for me with the pl_examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested waiting on author Waiting on user action, correction, or update won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

5 participants