Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad agent dev fails to start with 'undefined symbol: nvmlDeviceGetPciInfo_v3' #8303

Closed
shantanugadgil opened this issue Jun 28, 2020 · 20 comments · Fixed by #8353
Closed

Comments

@shantanugadgil
Copy link
Contributor

Nomad version

Nomad v0.12.0-beta2 (5b80d4e)
Same with Nomad 0.11.3 GA

Operating system and Environment details

Elementary Linux 5.x (based off Ubuntu 18.04)

uname -a
Linux mynodename 5.3.0-61-generic #55~18.04.1-Ubuntu SMP Mon Jun 22 16:40:20 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

lsb_release -a
No LSB modules are available.
Distributor ID: elementary
Description:    elementary OS 5.1.5 Hera
Release:        5.1.5
Codename:       hera

Issue

Nomad agent fails to start with the following error:

nomad agent -dev
==> No configuration files loaded
==> Starting Nomad agent...
nomad: symbol lookup error: nomad: undefined symbol: nvmlDeviceGetPciInfo_v3

Reproduction steps

run "nomad agent -dev"

Job file (if appropriate)

n/a

Nomad Client logs (if appropriate)

Additional information that might be useful:

lspci | grep -i vga
02:00.0 VGA compatible controller: NVIDIA Corporation G98 [Quadro NVS 295] (rev a1)

Steps to install the drivers was:

ubuntu-drivers autoinstall

Output of nvidia-smi

nvidia-smi

+------------------------------------------------------+
| NVIDIA-SMI 340.108    Driver Version: 340.108        |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro NVS 295      Off  | 0000:02:00.0     N/A |                  N/A |
| N/A   63C   P12    N/A /  N/A |     56MiB /   255MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
+-----------------------------------------------------------------------------+
@notnoop
Copy link
Contributor

notnoop commented Jun 30, 2020

Sorry for the slow response. Looks like nomad requires a more recent driver than the ones that are bundled with Linux kernel. Can you try upgrading your driver and let us know if that works?

It seems that Linux is bundling legacy drivers by default. Nomad currently requires a more recent versions like the ones bundled with CUDA 9 or 10.

I've tested nomad against driver 484.11 (bundled with CUDA 9) and that worked:

ubuntu@ip-172-31-26-165:~$ nvidia-smi
Tue Jun 30 17:32:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           On   | 00000000:00:03.0 Off |                  N/A |
| N/A   33C    P8    18W / 125W |      0MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
ubuntu@ip-172-31-26-165:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
ubuntu@ip-172-31-26-165:~$ ./nomad --version
Nomad v0.12.0-beta2 (5b80d4e638f1a27eee3ca245f8babb115e4c098d)
ubuntu@ip-172-31-26-165:~$ ./nomad agent -dev 2>&1 | head -n5
==> No configuration files loaded
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 127.0.0.1:4646; RPC: 127.0.0.1:4647; Serf: 127.0.0.1:4648

@shantanugadgil
Copy link
Contributor Author

The way I installed the drivers was ubuntu-drivers autoinstall.
Hopefully installing the latest drivers won't bork my system!!! 🤞

@shantanugadgil
Copy link
Contributor Author

FWIW, these drivers are not that old:

when I search the official website, I get this:

https://www.nvidia.com/object/product_quadro_nvs_295_us.html


Version: | 340.108
-- | --
Release Date: | 2019.12.23
Operating System: | Linux 64-bit
Language: | English (US)
File Size: | 66.92 MB

I will try to follow the wizard here: https://developer.nvidia.com/cuda-downloads to get the latest

@shantanugadgil
Copy link
Contributor Author

shantanugadgil commented Jun 30, 2020

Update: looks this won't happen anytime soon ... way too much download ~ 2 GiB ... 😢

EDIT: I cancelled this operation and tried installing from PPA

@notnoop
Copy link
Contributor

notnoop commented Jun 30, 2020

It's strange - when I looked for 340.108, I noticed it was marked legacy even though it was released in 2019 - e.g. https://forums.developer.nvidia.com/t/linux-solaris-and-freebsd-driver-340-108-legacy-for-geforce-8-and-9-series/109520#5414137 .

Stepping back a bit - let me clarify the use case. Are you actually planning to use this GPU with nomad for machine learning/CUDA-like workloads? Or is it that you are trying to start nomad on a server that just happened to have GPU though it's not critical to the nomad case?

Also, mind if you try running the nomad agent found in https://79969-36653430-gh.circle-artifacts.com/0/builds/nomad_linux_amd64.zip

@shantanugadgil
Copy link
Contributor Author

This is not a critical system. I just happened to have an old display card and decided to set it up on an old desktop (which was already running Elementary Linux)

I wouldn't be really using this for any real word CUDA workloads, though it would be good to have, as I could run trivial CUDA things on my desktop.

That said, if this doesn't fit on the roadmap due to it's "non real" use case, I am fine with that. (1)

Though, in that case, what would be the proper way to disable nvidia detection altogether during Nomad startup.
pt. 1 is fine, Nomad not starting at all is super sad (I should check up on disabling drivers using the client blocklist)

@shantanugadgil
Copy link
Contributor Author

Update: I tried adding the nvidia ppa and manually installing the "latest" available driver. this broke the nvidia driver altogether, I am down to VESA mode, but the agent starts now 🙄 .

add-apt-repository ppa:graphics-drivers/ppa
apt update
apt install nvidia-384

@notnoop
Copy link
Contributor

notnoop commented Jun 30, 2020

I'm very sorry that I have your system borked :(. Also, I fully agree that nomad agent should function with legacy nvidia drivers - the agent should start normally but without nvidia support. We'll follow up.

One odd thing is in my testing, I noticed that Ubuntu 18.04 offers nvidia-driver-440 (and other versions as well):

ubuntu@ip-172-31-19-213:~$ apt-cache madison nvidia-driver-440
nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages
nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages

@shantanugadgil
Copy link
Contributor Author

Though, in that case, what would be the proper way to disable nvidia detection altogether during Nomad startup.
pt. 1 is fine, Nomad not starting at all is super sad (I should check up on disabling drivers using the client blocklist)

This brings up a new question in my mind
Q: I am currently unable to disable the device detection altogether for device nvidia-gpu.
I know about disabling drivers via blacklist, but there doesn't seem to be anything equivalent for device plugins, right?

@shantanugadgil
Copy link
Contributor Author

I'm very sorry that I have your system borked :(. Also, I fully agree that nomad agent should function with legacy nvidia drivers - the agent should start normally but without nvidia support. We'll follow up.

One odd thing is in my testing, I noticed that Ubuntu 18.04 offers nvidia-driver-440 (and other versions as well):

ubuntu@ip-172-31-19-213:~$ apt-cache madison nvidia-driver-440
nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://us-east-1.ec2.archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages
nvidia-driver-440 | 440.100-0ubuntu0.18.04.1 | http://security.ubuntu.com/ubuntu bionic-security/restricted amd64 Packages

That's OK, always ready for trial-and-error to get Nomad working! 😄 🛩️

BTW, did you try with the NVIDIA NVS 295 display card? That is my display card. (maybe that matters, I dunno)

For me this is not "Ubuntu" Ubuntu, it is Elementary Linux (a desktop oriented) distro, hence I would prefer having the display driver working, higher resolution etc.

OK, after refining my apt search ...

apt search nvidia | grep "^nvidia\-driver\-"
nvidia-driver-390/bionic-updates,bionic-security 390.138-0ubuntu0.18.04.1 amd64
nvidia-driver-410/unknown 410.129-0ubuntu1 amd64
nvidia-driver-415/bionic 415.27-0ubuntu0~gpu18.04.2 amd64
nvidia-driver-418/bionic 430.64-0ubuntu0~gpu18.04.1 amd64
nvidia-driver-430/bionic-updates,bionic-security,bionic 440.100-0ubuntu0.18.04.1 amd64
nvidia-driver-435/bionic-updates,bionic 435.21-0ubuntu0.18.04.2 amd64
nvidia-driver-440/bionic-updates,bionic-security,bionic 440.100-0ubuntu0.18.04.1 amd64
nvidia-driver-450/unknown 450.36.06-0ubuntu1 amd64

I will try with 440 now.

@shantanugadgil
Copy link
Contributor Author

Also, mind if you try running the nomad agent found in https://79969-36653430-gh.circle-artifacts.com/0/builds/nomad_linux_amd64.zip

This doesn't work with my correct working display driver v 340.

I will try with v 440 and try again

$ nvidia-smi
Wed Jul  1 01:08:09 2020
+------------------------------------------------------+
| NVIDIA-SMI 340.108    Driver Version: 340.108        |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro NVS 295      Off  | 0000:02:00.0     N/A |                  N/A |
| N/A   69C   P12    N/A /  N/A |     56MiB /   255MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
+-----------------------------------------------------------------------------+

$ ./nomad --version
Nomad v0.12.0-dev (9f070e16db5c1aa1d28960a209740d584ab4abc0)

$ ./nomad agent -dev
==> No configuration files loaded
==> Starting Nomad agent...
./nomad: symbol lookup error: ./nomad: undefined symbol: nvmlDeviceGetPciInfo_v3

@shantanugadgil
Copy link
Contributor Author

Newer drivers are not working.
I have reinstalled the supported drivers using ubuntu-drivers autoinstall.
I am back to the higher resolution, etc.
For now I'll let this be, as having a higher resolution on the desktop is needed for now.

Though, I wish there was a clean fix for this! :)

@notnoop
Copy link
Contributor

notnoop commented Jul 21, 2020

@shantanugadgil We just merged an option for disabling the nvidia driver and it should be out in 0.12.1. Thanks for raising the issue.

@shantanugadgil
Copy link
Contributor Author

: waiting eagerly for 0.12.1 to test on my machine : 😁

@notnoop
Copy link
Contributor

notnoop commented Jul 21, 2020

For basic testing, you can try the binaries found in https://app.circleci.com/pipelines/github/hashicorp/nomad/10642/workflows/1ff98cc1-e847-434f-aff4-05acfbb6f993/jobs/84842/artifacts along with the config from the PR:

plugin "nvidia-gpu" {
  config {
    enabled = false
  }
}

Please try it and let me know how it goes!

@shantanugadgil
Copy link
Contributor Author

The Nomad agent is starting with the mentioned config above.

@notnoop
Copy link
Contributor

notnoop commented Jul 21, 2020

Perfect - thanks for letting us know!

@shantanugadgil
Copy link
Contributor Author

Any chances of getting older drivers to work with Nomad in the foreseeable future?

@notnoop
Copy link
Contributor

notnoop commented Jul 23, 2020

I suspect we'll unlikely try to support older drivers without strong demand; we'd be happy to link to community drivers if one exists ;-).

@github-actions
Copy link

github-actions bot commented Nov 4, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
3 participants