-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not iterating through all GPUs in system (3) #7
Comments
|
There's some troubleshooting advice on the main page; mind stepping through it? If you're uncomfortable with |
I didn't understand what pdb should do, I saw no changes. Added print, (==) Log file: "/var/log/Xorg.1.log", Time: Tue Jul 21 01:40:49 2020 this is it |
Try running
in a terminal. Adding |
nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1 ERROR: Unable to find display on any available system ERROR: Unable to find display on any available system First 2 gpus have no physical displays connected, 3rd one has a physical display connected. However, even without physical display, problem is the same. |
I'd like to mention, that I was able to set all fans in some random attempts, I don't even remember what I was doing, rebooted couple times and reinstalled driver, replugged physical display, and I had script passed a few times, but after closing it and running again, it hangs. |
I don't get this...
|
coolgpus will not work on any system with a display or any system that's expecting a display. You'll need to remove the display, restart, SSH in, and toy around until
works for |
Take a look at the pdb docs. No longer useful for this problem, but overall one of the most useful tools in Python programming. Especially the |
I did not have a display connected initially, it makes no difference. I've connected it at the last attempt to see what changes. Well, nothing. |
nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1 ERROR: Unable to find display on any available system ERROR: Unable to find display on any available system [root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :0 ERROR: Unable to find display on any available system ERROR: Unable to find display on any available system [root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :2 ERROR: Unable to find display on any available system ERROR: Unable to find display on any available system [root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :3 ERROR: Unable to find display on any available system ERROR: Unable to find display on any available system [root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :4 ERROR: Unable to find display on any available system ERROR: Unable to find display on any available system |
System is a plain Centos 7 with nvidia driver for headless transcoding, there is nothing custom to debug on it. |
"Unable to find display on any available system" |
Not to discourage you too much but: it feels like you're hoping that I have more knowledge about this than I do. Your system absolutely has something to debug, as you can tell by the way a thing you want to do isn't working as you'd expect it to. If you want to push forward with this, a general loop should be:
It's hard! This might take hours or days! You might have to learn huge amounts about subjects that are totally irrelevant, just to check one possible fix! It probably won't be worth it! But, frankly: the only other choice is to give up and decide you don't care that much about coolgpu's functionality. |
Unable to find display on any available - is literally what it says, no displays attached, either physical or virtual. Doesn't your script do a virtual displays to set fan speeds? Apparently it does, because I'm able to run "nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :2" but no ouput comes, while having your script open in another SSH window. So I don't understand what exactly or why "Unable to find display on any available" message has to be fixed. It is expected and not related to the issue described. |
OK - in that case, use pdb, use print statements - figure out where it is the script is actually hanging, then make sure you can replicate it yourself, then Google around to figure out what's causing that hanging. You may need to replicate the xserver setup the script is doing to replicate the hanging. You might want to tear the xserver setup bit of coolgpus out into your own script, then run that and leave it running in the background while you experiment. There're lots of ways forward, just requires a bit of ingenuity! |
I did print already and posted earlier, it hangs on ['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':0'] I always have to kill Xorg like this, because it never exists. Reinstalling xorg server does not help. |
Ok, let's make it easier. I need to adjust only last GPU in list, how to select only one GPU with this script and skip others? |
Right! That's the spirit. The answer is: there's no built in method. Try cloning this repo and editing the script yourself; add a conditional somewhere to only look at specific GPUs. More generally, you know where the script hangs but you haven't isolated the aberrant behaviour. You want to be able to enter a series of commands into the terminal and get the same hang. Then you can experiment freely with that series of commands, try different versions, add |
I think I get the same error when running
At the same time, I am able to run name: coolgpus
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- ca-certificates=2020.1.1=0
- certifi=2019.11.28=py38_0
- ld_impl_linux-64=2.33.1=h53a641e_7
- libedit=3.1.20181209=hc058e9b_0
- libffi=3.2.1=hd88cf55_4
- libgcc-ng=9.1.0=hdf63c60_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- ncurses=6.2=he6710b0_0
- openssl=1.1.1d=h7b6447c_4
- pip=20.0.2=py38_1
- python=3.8.1=h0371630_1
- readline=7.0=h7b6447c_5
- setuptools=45.2.0=py38_0
- sqlite=3.31.1=h7b6447c_0
- tk=8.6.8=hbc83047_0
- wheel=0.34.2=py38_0
- xz=5.2.4=h14c3975_4
- zlib=1.2.11=h7b6447c_3
- pip:
- coolgpus==0.17 |
Yep, I think @Neolo is right about the ERROR being a symptom of the missing xserver env. Still expect that command to be the source of the hang since it was the last command printed, just gonna need more work to get a manual reproduction. I'll be surprised if the env is causing the hang, but it's a good idea since it's an easy thing to check. |
So weird. I just removed contents of /etc/X11/ and ran nvidia-xconfig --allow-empty-initial-configuration --enable-all-gpus --cool-bits=28 --separate-x-screens --enable-all-gpus --use-display-device=none Using X configuration file: "/etc/X11/xorg.conf". and right after for one time only I was able to pass fan speeds to 1st AND 2nd gpus only, 3rd hanged. |
I'm not into python, tell me what to run, I didn't get it. |
|
Will try that environment tomorrow.
As for now, I just made a dirty trick to select a last GPU from the list, which is "burning" right now at 72 C.
and it sets the speed fine, no hangs, |
Installed Miniconda, activated env, running "$(which coolgpus) --temp 60 60" just doesn't do anything, not even setting the first GPU at all. [root@nvidia-2 ~]# conda env create -f env.yml X.Org X Server 1.19.3 X.Org X Server 1.19.3 X.Org X Server 1.19.3 |
I am having the same issue and nobody can solve the issue, What I do is to create a custom xorg file that its working for me, and then on the nvidia app on ubuntu I can on powermizer change the settings of each fan for each GPUs, this is not usefull when working via SSH on headless but unfourtunatly there is no solution anywhere for an easy to use nvidia-settings. So basically this is what worked for me Table of content: How change GPU fan speeds in Ubuntu1- In the applications, open NVIDIA X Server Settings 2- Select the GPU currently used for display output (should be the GPU in first PCIe slot) 3- Take note of the Bus ID 4- Run the following commands *sudo nvidia-xconfig --enable-all-gpus
sudo nvidia-xconfig --cool-bits=28
sudo reboot* 5- After the computer reboots, plug the monitor into the last GPU 6- Open NVIDIA X-Server Settings again 7- Select the GPU currently used for display output 8- Take note of the Bus ID 9- Run sudo nano /etc/X11/xorg.conf The GPUs will be listed in “Device” sections with formatting similar to this: **Section** “Device”
**Identifier** “name”
**Driver** “driver”entries…
**EndSection** 10- Identify the GPUs with the Bus IDs that were previously noted 11- Swap the Bus IDs of the two GPUs 12- Press Ctrl+X to close “xorg.conf” 13- Press Y to save the file 14- Press “Enter” without changing the file name 15- Reboot Fan speeds can now be changed from NVIDIA X Server Settings by selecting the Thermal Settings for each GPU and checking the option to “Enable GPU Fan Settings” |
Never version - worse it works. `(==) Log file: "/var/log/Xorg.2.log", Time: Mon Jan 25 22:02:05 2021 Command timed out: nvidia-settings -a [gpu:0]/GPUFanControlState=1 -c :1 Released fan speed control for GPU at :0 Terminating xserver for display :0 During handling of the above exception, another exception occurred: Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/bin/coolgpus", line 266, in **for Gods sake.... ridiculous import subprocess and in the function kill_xservers() on top of it:
ditch the rest of this function. Solved.** |
For anyone still experiencing this issue, I have slapped together a bash script which at least allows for setting a fixed fan speed for all GPU in the system, regardless if a monitor is attached. It supports amdgpu too: https://github.com/lavanoid/Linux_GPU_Fan_Control |
Not iterating through all GPUs in system (3 of them), stuck at 1st in the list and hangs forever. 1070, 750 Ti, 950
The text was updated successfully, but these errors were encountered: