Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit GPU binding with CUDA_VISIBLE_DEVICES or so #12

Open
d355 opened this issue Dec 7, 2020 · 3 comments
Open

Limit GPU binding with CUDA_VISIBLE_DEVICES or so #12

d355 opened this issue Dec 7, 2020 · 3 comments

Comments

@d355
Copy link

d355 commented Dec 7, 2020

Hello, and, first all I'd like to thank you for project, it's still the best way we found to workaround NVIDIA cooling issues.

To the point. Thanks to latest NVIDIA drivers updates, now instead of usual primary contexts [with nwidia-smi tool] we have displayed all contexts created. So if earlier we've got output like this:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3541      G   /usr/libexec/Xorg                   8MiB |
|    1   N/A  N/A      3543      G   /usr/libexec/Xorg                   8MiB |
|    2   N/A  N/A      3544      G   /usr/libexec/Xorg                   8MiB |
|    3   N/A  N/A      3546      G   /usr/libexec/Xorg                   8MiB |
|    4   N/A  N/A      3548      G   /usr/libexec/Xorg                   8MiB |
|    5   N/A  N/A      3549      G   /usr/libexec/Xorg                   8MiB |
|    6   N/A  N/A      3550      G   /usr/libexec/Xorg                   8MiB |
|    7   N/A  N/A      3552      G   /usr/libexec/Xorg                   8MiB |
+-----------------------------------------------------------------------------+

...now we have:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    400553      G   /usr/libexec/Xorg                   8MiB |
|    0   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    0   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400554      G   /usr/libexec/Xorg                   8MiB |
|    1   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    1   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400555      G   /usr/libexec/Xorg                   8MiB |
|    2   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    2   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400556      G   /usr/libexec/Xorg                   8MiB |
|    3   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    3   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400557      G   /usr/libexec/Xorg                   8MiB |
|    4   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    4   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400558      G   /usr/libexec/Xorg                   8MiB |
|    5   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    5   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    6   N/A  N/A    400559      G   /usr/libexec/Xorg                   8MiB |
|    6   N/A  N/A    400560      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400553      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400554      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400555      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400556      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400557      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400558      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400559      G   /usr/libexec/Xorg                   0MiB |
|    7   N/A  N/A    400560      G   /usr/libexec/Xorg                   8MiB |
+-----------------------------------------------------------------------------+

Is it possible to limit Xorg processes with something like CUDA_VISIBLE_DEVICES environment variable ( https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/ )?

I guess some minor changes are needed somewhere arond this line so each Xorg instance run like CUDA_VISIBLE_DEVICES=1 Xorg ... .

@andyljones
Copy link
Owner

Lawd, that's a mess.

To triage this properly, are there any consequences other than nvidia-smi being very tall?

Also I'm unlikely to personally upgrade the drivers any time soon, and I don't like to fix bugs blind. I think the fix should be as simple as

p = Popen(xorgargs, env={'CUDA_VISIBLE_DEVICES': display[1:]})

Would you be able to make this change yourself and test it out for a few days? If this particular change fails, try adding a breakpoint() immediately before the line; it'll drop you into pdb and you can have a poke around.

@d355
Copy link
Author

d355 commented Dec 7, 2020

Thank you! Sure, I'll check it out and report result here.

@chenz97
Copy link

chenz97 commented Jun 28, 2021

Lawd, that's a mess.

To triage this properly, are there any consequences other than nvidia-smi being very tall?

Also I'm unlikely to personally upgrade the drivers any time soon, and I don't like to fix bugs blind. I think the fix should be as simple as

p = Popen(xorgargs, env={'CUDA_VISIBLE_DEVICES': display[1:]})

Would you be able to make this change yourself and test it out for a few days? If this particular change fails, try adding a breakpoint() immediately before the line; it'll drop you into pdb and you can have a poke around.

Hi, I have tried the modification here but it didn't work. I have found another workaround.
In the source of coolgpus, just replace

buses = gpu_buses()

with the specific gpu bus_id you would like coolgpus to take effect, e.g.

buses = ['00000000:65:00.0']

The bus id could be seen from the output of nvidia-smi. Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants