Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Nvidia GPU Device Plugin" not working #14888

Closed
alexgornov opened this issue Oct 13, 2022 · 18 comments · Fixed by #15125
Closed

"Nvidia GPU Device Plugin" not working #14888

alexgornov opened this issue Oct 13, 2022 · 18 comments · Fixed by #15125

Comments

@alexgornov
Copy link

alexgornov commented Oct 13, 2022

Nomad version

Nomad v1.3.6+

Operating system and Environment details

CentOS Stream 8
Plugin "nomad-device-nvidia" v 1.0.0 (https://releases.hashicorp.com/nomad-device-nvidia/1.0.0/nomad-device-nvidia_1.0.0_linux_amd64.zip)
NVIDIA-SMI 515.57
Driver Version: 515.57
CUDA Version: 11.7

Issue

"Nvidia GPU Device Plugin" not working on Nomad v1.3.6+

Reproduction steps

Install plugin nomad-device-nvidia on nomad v1.3.6+
Config file:

plugin_dir = "/opt/nomad/plugins"
...
plugin "nomad-device-nvidia" {
  config {
    enabled            = true
    fingerprint_period = "1m"
  }
}

Expected Result

There is a GPU in the output of the "nomad node status " command
1.3.5:
135
Log 1.3.5:
nomad1.3.5.log

Actual Result

There is no GPU in output of "nomad node status " command
136

Log 1.3.6:
nomad1.3.6.log

Thank you!

@shoeffner
Copy link

shoeffner commented Oct 13, 2022

I see the same behavior on 1.4.1. You can also quickly test it with the following job (which will not be planned if the GPU is not detected properly):

job "gpu-test" {
  datacenters = ["dc1"]
  type = "batch"

  group "smi" {
    task "smi" {
      driver = "docker"

      config {
        image = "nvidia/cuda:11.0.3-base-ubuntu20.04"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }
}

@alexgornov
Copy link
Author

I see the same behavior on 1.4.1. You can also quickly test it with the following job (which will not be planned if the GPU is not detected properly):

job "gpu-test" {
  datacenters = ["dc1"]
  type = "batch"

  group "smi" {
    task "smi" {
      driver = "docker"

      config {
        image = "nnvidia/cuda:11.0.3-base-ubuntu20.04"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }
}

job not starting
image

@shoeffner
Copy link

shoeffner commented Oct 13, 2022

It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed.

edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem.

@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 17, 2022
@Fr0stoff
Copy link

I have exact issue with 1.4.1, but 1.3.1 works fine

@Fr0stoff
Copy link

It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed.

edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem.

have same problem with Nomad 1.4.1 , did you managed to run GPU job some how?
Thanks.

@shoeffner
Copy link

shoeffner commented Oct 26, 2022 via email

@heipei
Copy link

heipei commented Oct 28, 2022

Struggling with the same issue on 1.4.2 fwiw.

@jessfraz
Copy link

jessfraz commented Nov 1, 2022

Seeing this on 1.4.2 as well.

@Mileshin
Copy link

Mileshin commented Nov 2, 2022

I have the same problem when upgrading from 1.3.5 to 1.4.1.

@tgross
Copy link
Member

tgross commented Nov 2, 2022

Hi folks, we've seen this issue and it's on our pile to triage. If you use the reaction 👍 on the top-level post that's more helpful, unless you're seeing the problem on a different driver version than the original post.

Does anyone have debug-level logs from a client during startup for this issue? It'd help kick off our investigation to see if the problem is in fingerprinting the device or whether the problem is in communicating with the driver.

@shoeffner
Copy link

shoeffner commented Nov 2, 2022 via email

@heipei
Copy link

heipei commented Nov 2, 2022

I've compiled a debug log during startup, let me know if this helps: https://gist.github.com/heipei/6d71b12fa086486b907729763981f27c

@tgross
Copy link
Member

tgross commented Nov 3, 2022

Thanks @heipei. I've extracted the relevant log lines here:

device plugin logs

2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: starting plugin: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"]
2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: plugin started: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964909
2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: waiting for RPC address: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia
2022-11-02T20:01:12.259Z [DEBUG] agent.plugin_loader: using plugin: plugin_dir=/opt/nomad/plugins version=2
2022-11-02T20:01:12.259Z [DEBUG] agent.plugin_loader.nomad-device-nvidia: plugin address: plugin_dir=/opt/nomad/plugins address=/tmp/plugin070418469 network=unix timestamp=2022-11-02T20:01:12.258Z
2022-11-02T20:01:12.260Z [DEBUG] agent.plugin_loader.stdio: received EOF, stopping recv loop: plugin_dir=/opt/nomad/plugins err="rpc error: code = Unavailable desc = error reading from server: EOF"
2022-11-02T20:01:12.356Z [DEBUG] agent.plugin_loader: plugin process exited: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964909
2022-11-02T20:01:12.356Z [DEBUG] agent.plugin_loader: plugin exited: plugin_dir=/opt/nomad/plugins
2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: starting plugin: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"]
2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: plugin started: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964923
2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: waiting for RPC address: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia
2022-11-02T20:01:13.055Z [DEBUG] agent.plugin_loader.nomad-device-nvidia: plugin address: plugin_dir=/opt/nomad/plugins network=unix address=/tmp/plugin547638610 timestamp=2022-11-02T20:01:13.055Z
2022-11-02T20:01:13.055Z [DEBUG] agent.plugin_loader: using plugin: plugin_dir=/opt/nomad/plugins version=2
2022-11-02T20:01:13.056Z [DEBUG] agent.plugin_loader.stdio: received EOF, stopping recv loop: plugin_dir=/opt/nomad/plugins err="rpc error: code = Unavailable desc = error reading from server: EOF"
2022-11-02T20:01:13.158Z [DEBUG] agent.plugin_loader: plugin process exited: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964923
2022-11-02T20:01:13.158Z [DEBUG] agent.plugin_loader: plugin exited: plugin_dir=/opt/nomad/plugins
...
2022-11-02T20:01:13.158Z [INFO] agent: detected plugin: name=nvidia-gpu type=device plugin_version=1.0.0
...
2022-11-02T20:01:23.170Z [INFO] client.plugin: starting plugin manager: plugin-type=device
2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: starting plugin: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"]
...
2022-11-02T20:01:23.170Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
...
2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: plugin started: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia pid=964949
2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: waiting for RPC address: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia
...
2022-11-02T20:01:23.879Z [DEBUG] client.device_mgr: using plugin: plugin=nvidia-gpu version=2
2022-11-02T20:01:23.879Z [DEBUG] client.device_mgr.nomad-device-nvidia: plugin address: plugin=nvidia-gpu address=/tmp/plugin379502824 network=unix timestamp=2022-11-02T20:01:23.879Z
2022-11-02T20:01:23.906Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
2022-11-02T20:01:23.906Z [DEBUG] client: new devices detected: devices=1

It looks like the plugin is having trouble fingerprinting during the initial startup, but it's succeeding later (enough for the scheduler to detect that the client has done so, at least). I know it's been a minute since we did a Nvidia driver release, so I took a look at the repo and was reminded of hashicorp/nomad-device-nvidia#6. There hasn't been a release of the changes made there. @shoenig you noted in the PR that we had some fixes to do with the implementation -- do you recall what the symptoms were there? (If not, I can try to stand up a box on AWS with an Nvidia card and dig in further.)

@shoenig
Copy link
Member

shoenig commented Nov 3, 2022

IIRC what #6 uncovered was that the external plugin still imported the nvidia stuff from nomad, the act of which was enough to trigger an init block causing <???> bad things to happen. We may just need to finally cut a release with all the changes on the plugin side; let me try.

@vuuihc
Copy link
Contributor

vuuihc commented Nov 3, 2022

Hi @tgross , I encountered the same problem today, by researching the code, I found that the Devices was not successfully updated in batchFirstFingerprints, maybe I can help fix it, so I made a PR, can you review it when you have time?

@shoenig
Copy link
Member

shoenig commented Nov 3, 2022

Ah nice find @vuuihc, indeed this looks like fallout from #14139.

@shoenig
Copy link
Member

shoenig commented Nov 3, 2022

Thanks for investigating and the PR @vuuihc! Fix should go out in the next releases of 1.4.x, 1.3.x, and 1.2.x

@github-actions
Copy link

github-actions bot commented Mar 4, 2023

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

9 participants