Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad blocks rmmod of nvidia.ko #5

Open
andaag opened this issue Aug 1, 2021 · 2 comments
Open

nomad blocks rmmod of nvidia.ko #5

andaag opened this issue Aug 1, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@andaag
Copy link

andaag commented Aug 1, 2021

Nomad version

Nomad v1.1.3 (8c0c8140997329136971e66e4c2337dfcf932692)

Operating system and Environment details

Ubuntu 20.04.2 LTS, single cluster, nomad running as user (for testing)

Issue

nomad agent -config=full.nomad

with:

plugin "nvidia-gpu" {
  config {
    enabled = false
   }
}

Reproduction steps

rmmod nvidia - works
modprobe nvidia - works
nomad agent &
rmmod nvidia - no longer works.

According to nvidia-smi no processes are using the nvidia module, but it's definitely the nomad agent process that's blocking it - even with nvidia-gpu off.

Expected Result

With nvidia-gpu off I should be able to unload the nvidia module.

Actual Result

nomad blocks rmmod

Extra info:

I've tried running with ignored_gpu_ids instead, and then get this message:

    2021-08-01T11:39:59.532+0200 [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
    2021-08-01T11:40:05.698+0200 [INFO]  client.device_mgr: fingerprinting failed: plugin is not enabled: plugin=nvidia-gpu

So presumably disabling the plugin does work?

The nvidia module is in general allocated to qemu, but I haven't moved any qemu work into nomad yet. This blocks me from testing nomad, as I now can't run my other containers. I've tried turning off both qemu and nvidia-gpu, but it still locks access to the nvidia.ko module.

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 19, 2021

Thanks for the report @andaag 🙂

@tgross
Copy link
Member

tgross commented Jan 10, 2022

Sorry that looking into this got delayed, @andaag.

As of Nomad 1.2.0 the Nvidia driver has been externalized. But either way it shouldn't have been shutting out rmmod if the plugin was disabled. I took a quick look at the code and my suspicion is that when we instantiate the NVML client in NewNvidiaDevice, that the client has some side-effect that's keeping the module locked out.

We need to call NewNvidiaDevice before we can check the enabled flag, but I don't see any reason why we couldn't construct the NVML client lazily in SetConfig(). I don't have a good set up to actually test this theory but it's a small change to do.

I'm going to self-assign this issue but move it to the device plugin repo.

@tgross tgross transferred this issue from hashicorp/nomad Jan 10, 2022
@tgross tgross self-assigned this Jan 10, 2022
@tgross tgross added the bug Something isn't working label Jan 10, 2022
@tgross tgross removed their assignment Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants