"Nvidia GPU Device Plugin" not working #14888

alexgornov · 2022-10-13T11:31:36Z

Nomad version

Nomad v1.3.6+

Operating system and Environment details

CentOS Stream 8
Plugin "nomad-device-nvidia" v 1.0.0 (https://releases.hashicorp.com/nomad-device-nvidia/1.0.0/nomad-device-nvidia_1.0.0_linux_amd64.zip)
NVIDIA-SMI 515.57
Driver Version: 515.57
CUDA Version: 11.7

Issue

"Nvidia GPU Device Plugin" not working on Nomad v1.3.6+

Reproduction steps

Install plugin nomad-device-nvidia on nomad v1.3.6+
Config file:

plugin_dir = "/opt/nomad/plugins"
...
plugin "nomad-device-nvidia" {
  config {
    enabled            = true
    fingerprint_period = "1m"
  }
}

Expected Result

There is a GPU in the output of the "nomad node status " command
1.3.5:

Log 1.3.5:
nomad1.3.5.log

Actual Result

There is no GPU in output of "nomad node status " command

Log 1.3.6:
nomad1.3.6.log

Thank you!

The text was updated successfully, but these errors were encountered:

shoeffner · 2022-10-13T14:04:52Z

I see the same behavior on 1.4.1. You can also quickly test it with the following job (which will not be planned if the GPU is not detected properly):

job "gpu-test" {
  datacenters = ["dc1"]
  type = "batch"

  group "smi" {
    task "smi" {
      driver = "docker"

      config {
        image = "nvidia/cuda:11.0.3-base-ubuntu20.04"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }
}

alexgornov · 2022-10-13T14:13:09Z

I see the same behavior on 1.4.1. You can also quickly test it with the following job (which will not be planned if the GPU is not detected properly):

job "gpu-test" {
  datacenters = ["dc1"]
  type = "batch"

  group "smi" {
    task "smi" {
      driver = "docker"

      config {
        image = "nnvidia/cuda:11.0.3-base-ubuntu20.04"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }
}

job not starting

shoeffner · 2022-10-13T14:35:16Z

It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed.

edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem.

Fr0stoff · 2022-10-26T07:55:51Z

I have exact issue with 1.4.1, but 1.3.1 works fine

Fr0stoff · 2022-10-26T08:06:23Z

It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed.

edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem.

have same problem with Nomad 1.4.1 , did you managed to run GPU job some how?
Thanks.

shoeffner · 2022-10-26T13:05:19Z

No, we downgraded the GPU clients back to 1.3.5 and only run the servers on 1.4.1.

…

On Wed, Oct 26, 2022, 10:06 Fr0stoff ***@***.***> wrote: It seems I smuggled an additional n into the image name (will fix it now), but on 1.3.2 the job is starting, on 1.4.1 it is not placed, thus the same behavior you observed. edit: I just realized, maybe we had a misunderstanding. My idea was to add a minimal working example for the maintainers to reproduce the problem. have same problem with Nomad 1.4.1 , did you managed to run GPU job some how? Thanks. — Reply to this email directly, view it on GitHub <#14888 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAOD6PDJCT7MESSUYAJIDWFDRATANCNFSM6AAAAAAREFYDV4> . You are receiving this because you commented.Message ID: ***@***.***>

heipei · 2022-10-28T11:33:05Z

Struggling with the same issue on 1.4.2 fwiw.

jessfraz · 2022-11-01T18:24:49Z

Seeing this on 1.4.2 as well.

Mileshin · 2022-11-02T11:08:12Z

I have the same problem when upgrading from 1.3.5 to 1.4.1.

tgross · 2022-11-02T12:59:52Z

Hi folks, we've seen this issue and it's on our pile to triage. If you use the reaction 👍 on the top-level post that's more helpful, unless you're seeing the problem on a different driver version than the original post.

Does anyone have debug-level logs from a client during startup for this issue? It'd help kick off our investigation to see if the problem is in fingerprinting the device or whether the problem is in communicating with the driver.

shoeffner · 2022-11-02T17:03:09Z

I can generate debug level logs tomorrow.

…

On Wed, Nov 2, 2022, 14:00 Tim Gross ***@***.***> wrote: Hi folks, we've seen this issue and it's on our pile to triage. If you use the reaction 👍 on the top-level post that's more helpful, unless you're seeing the problem on a different driver version than the original post. Does anyone have debug-level logs from a client during startup for this issue? It'd help kick off our investigation to see if the problem is in fingerprinting the device or whether the problem is in communicating with the driver. — Reply to this email directly, view it on GitHub <#14888 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOAODYWWPUHDKHTDDSKMRDWGJQVHANCNFSM6AAAAAAREFYDV4> . You are receiving this because you commented.Message ID: ***@***.***>

heipei · 2022-11-02T20:17:58Z

I've compiled a debug log during startup, let me know if this helps: https://gist.github.com/heipei/6d71b12fa086486b907729763981f27c

tgross · 2022-11-03T12:40:39Z

Thanks @heipei. I've extracted the relevant log lines here:

device plugin logs

2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: starting plugin: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"]
2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: plugin started: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964909
2022-11-02T20:01:11.561Z [DEBUG] agent.plugin_loader: waiting for RPC address: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia
2022-11-02T20:01:12.259Z [DEBUG] agent.plugin_loader: using plugin: plugin_dir=/opt/nomad/plugins version=2
2022-11-02T20:01:12.259Z [DEBUG] agent.plugin_loader.nomad-device-nvidia: plugin address: plugin_dir=/opt/nomad/plugins address=/tmp/plugin070418469 network=unix timestamp=2022-11-02T20:01:12.258Z
2022-11-02T20:01:12.260Z [DEBUG] agent.plugin_loader.stdio: received EOF, stopping recv loop: plugin_dir=/opt/nomad/plugins err="rpc error: code = Unavailable desc = error reading from server: EOF"
2022-11-02T20:01:12.356Z [DEBUG] agent.plugin_loader: plugin process exited: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964909
2022-11-02T20:01:12.356Z [DEBUG] agent.plugin_loader: plugin exited: plugin_dir=/opt/nomad/plugins
2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/opt/nomad/plugins
2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: starting plugin: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"]
2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: plugin started: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964923
2022-11-02T20:01:12.357Z [DEBUG] agent.plugin_loader: waiting for RPC address: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia
2022-11-02T20:01:13.055Z [DEBUG] agent.plugin_loader.nomad-device-nvidia: plugin address: plugin_dir=/opt/nomad/plugins network=unix address=/tmp/plugin547638610 timestamp=2022-11-02T20:01:13.055Z
2022-11-02T20:01:13.055Z [DEBUG] agent.plugin_loader: using plugin: plugin_dir=/opt/nomad/plugins version=2
2022-11-02T20:01:13.056Z [DEBUG] agent.plugin_loader.stdio: received EOF, stopping recv loop: plugin_dir=/opt/nomad/plugins err="rpc error: code = Unavailable desc = error reading from server: EOF"
2022-11-02T20:01:13.158Z [DEBUG] agent.plugin_loader: plugin process exited: plugin_dir=/opt/nomad/plugins path=/opt/nomad/plugins/nomad-device-nvidia pid=964923
2022-11-02T20:01:13.158Z [DEBUG] agent.plugin_loader: plugin exited: plugin_dir=/opt/nomad/plugins
...
2022-11-02T20:01:13.158Z [INFO] agent: detected plugin: name=nvidia-gpu type=device plugin_version=1.0.0
...
2022-11-02T20:01:23.170Z [INFO] client.plugin: starting plugin manager: plugin-type=device
2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: starting plugin: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia args=["/opt/nomad/plugins/nomad-device-nvidia"]
...
2022-11-02T20:01:23.170Z [DEBUG] client.plugin: waiting on plugin manager initial fingerprint: plugin-type=device
...
2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: plugin started: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia pid=964949
2022-11-02T20:01:23.170Z [DEBUG] client.device_mgr: waiting for RPC address: plugin=nvidia-gpu path=/opt/nomad/plugins/nomad-device-nvidia
...
2022-11-02T20:01:23.879Z [DEBUG] client.device_mgr: using plugin: plugin=nvidia-gpu version=2
2022-11-02T20:01:23.879Z [DEBUG] client.device_mgr.nomad-device-nvidia: plugin address: plugin=nvidia-gpu address=/tmp/plugin379502824 network=unix timestamp=2022-11-02T20:01:23.879Z
2022-11-02T20:01:23.906Z [DEBUG] client.plugin: finished plugin manager initial fingerprint: plugin-type=device
2022-11-02T20:01:23.906Z [DEBUG] client: new devices detected: devices=1

It looks like the plugin is having trouble fingerprinting during the initial startup, but it's succeeding later (enough for the scheduler to detect that the client has done so, at least). I know it's been a minute since we did a Nvidia driver release, so I took a look at the repo and was reminded of hashicorp/nomad-device-nvidia#6. There hasn't been a release of the changes made there. @shoenig you noted in the PR that we had some fixes to do with the implementation -- do you recall what the symptoms were there? (If not, I can try to stand up a box on AWS with an Nvidia card and dig in further.)

shoenig · 2022-11-03T14:55:04Z

IIRC what #6 uncovered was that the external plugin still imported the nvidia stuff from nomad, the act of which was enough to trigger an init block causing <???> bad things to happen. We may just need to finally cut a release with all the changes on the plugin side; let me try.

may fix hashicorp#14888

vuuihc · 2022-11-03T17:42:21Z

Hi @tgross , I encountered the same problem today, by researching the code, I found that the Devices was not successfully updated in batchFirstFingerprints, maybe I can help fix it, so I made a PR, can you review it when you have time？

shoenig · 2022-11-03T18:10:05Z

Ah nice find @vuuihc, indeed this looks like fallout from #14139.

shoenig · 2022-11-03T21:33:24Z

Thanks for investigating and the PR @vuuihc! Fix should go out in the next releases of 1.4.x, 1.3.x, and 1.2.x

github-actions · 2023-03-04T02:16:18Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

alexgornov added the type/bug label Oct 13, 2022

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 17, 2022

tgross added the theme/devices label Nov 2, 2022

vuuihc pushed a commit to vuuihc/nomad that referenced this issue Nov 3, 2022

fix: batchFirstFingerprints does not update device on node after v1.3.5

1d34291

may fix hashicorp#14888

vuuihc mentioned this issue Nov 3, 2022

fix: batchFirstFingerprints does not update device on node after v1.3.5 #15125

Merged

shoenig closed this as completed in #15125 Nov 3, 2022

Nomad - Community Issues Triage automation moved this from Needs Triage to Done Nov 3, 2022

github-actions bot locked as resolved and limited conversation to collaborators Mar 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Nvidia GPU Device Plugin" not working #14888

"Nvidia GPU Device Plugin" not working #14888

alexgornov commented Oct 13, 2022 •

edited

Loading

shoeffner commented Oct 13, 2022 •

edited

Loading

alexgornov commented Oct 13, 2022

shoeffner commented Oct 13, 2022 •

edited

Loading

Fr0stoff commented Oct 26, 2022

Fr0stoff commented Oct 26, 2022

shoeffner commented Oct 26, 2022 via email

heipei commented Oct 28, 2022

jessfraz commented Nov 1, 2022

Mileshin commented Nov 2, 2022

tgross commented Nov 2, 2022

shoeffner commented Nov 2, 2022 via email

heipei commented Nov 2, 2022

tgross commented Nov 3, 2022

shoenig commented Nov 3, 2022

vuuihc commented Nov 3, 2022

shoenig commented Nov 3, 2022

shoenig commented Nov 3, 2022

github-actions bot commented Mar 4, 2023

"Nvidia GPU Device Plugin" not working #14888

"Nvidia GPU Device Plugin" not working #14888

Comments

alexgornov commented Oct 13, 2022 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

shoeffner commented Oct 13, 2022 • edited Loading

alexgornov commented Oct 13, 2022

shoeffner commented Oct 13, 2022 • edited Loading

Fr0stoff commented Oct 26, 2022

Fr0stoff commented Oct 26, 2022

shoeffner commented Oct 26, 2022 via email

heipei commented Oct 28, 2022

jessfraz commented Nov 1, 2022

Mileshin commented Nov 2, 2022

tgross commented Nov 2, 2022

shoeffner commented Nov 2, 2022 via email

heipei commented Nov 2, 2022

tgross commented Nov 3, 2022

shoenig commented Nov 3, 2022

vuuihc commented Nov 3, 2022

shoenig commented Nov 3, 2022

shoenig commented Nov 3, 2022

github-actions bot commented Mar 4, 2023

alexgornov commented Oct 13, 2022 •

edited

Loading

shoeffner commented Oct 13, 2022 •

edited

Loading

shoeffner commented Oct 13, 2022 •

edited

Loading