Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSL2 - No devices found. Waiting indefinitely. #646

Closed
5 of 10 tasks
qingfengfenga opened this issue Apr 14, 2024 · 3 comments
Closed
5 of 10 tasks

WSL2 - No devices found. Waiting indefinitely. #646

qingfengfenga opened this issue Apr 14, 2024 · 3 comments

Comments

@qingfengfenga
Copy link

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): WSL2 Ubuntu 22.04.3 LTS
  • Kernel Version: 5.15.146.1-microsoft-standard-WSL2
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Docker
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k3s v1.28.8-k3s1

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

The current issue is that the nvidia device plugin pod can execute nvidia smi, but the logs indicate that the graphics card cannot be recognized.

Detailed problem description

justinthelaw/k3d-gpu-support#1

Reference

k3d-io/k3d#1108 (comment)

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host

NVIDIA-SMI-LOG.txt

  • Your docker configuration file (e.g: /etc/docker/daemon.json)
{
 "builder": {
   "gc": {
     "defaultKeepStorage": "20GB",
     "enabled": true
   }
 },
 "experimental": false,
 "registry-mirrors": [
   "https://ehc6d6n1.mirror.aliyuncs.com",
   "https://ghcr.nju.edu.cn"
 ]
}
  • The k8s-device-plugin container logs
$ kubectl logs nvidia-device-plugin-daemonset-vvpkz -n kube-system
I0414 10:00:33.522494       1 main.go:154] Starting FS watcher.
I0414 10:00:33.522555       1 main.go:161] Starting OS watcher.
I0414 10:00:33.522912       1 main.go:176] Starting Plugins.
I0414 10:00:33.522931       1 main.go:234] Loading configuration.
I0414 10:00:33.522979       1 main.go:242] Updating config with default resource matching patterns.
I0414 10:00:33.523113       1 main.go:253]
Running with config:
{
 "version": "v1",
 "flags": {
   "migStrategy": "none",
   "failOnInitError": true,
   "nvidiaDriverRoot": "/",
   "gdsEnabled": false,
   "mofedEnabled": false,
   "plugin": {
     "passDeviceSpecs": true,
     "deviceListStrategy": [
       "envvar"
     ],
     "deviceIDStrategy": "uuid",
     "cdiAnnotationPrefix": "cdi.k8s.io/",
     "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
     "containerDriverRoot": "/driver-root"
   }
 },
 "resources": {
   "gpus": [
     {
       "pattern": "*",
       "name": "nvidia.com/gpu"
     }
   ]
 },
 "sharing": {
   "timeSlicing": {}
 }
}
I0414 10:00:33.523131       1 main.go:256] Retreiving plugins.
I0414 10:00:33.524465       1 factory.go:107] Detected NVML platform: found NVML library
I0414 10:00:33.524495       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0414 10:00:33.541706       1 main.go:287] No devices found. Waiting indefinitely.
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
    Docker Desktop 4.28.0 (139021)
  • Docker command, image and tag used
    nvcr.io/nvidia/k8s-device-plugin:v0.14.5
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • NVIDIA container library version from nvidia-container-cli -V
@elezar
Copy link
Member

elezar commented Apr 15, 2024

@qingfengfenga there was some work done for WSL2 in the 0.15.0 release branch. Could you test using the 0.15.0-rc.2 version instead of 0.14.5?

@dbreyfogle
Copy link

Hi @qingfengfenga, I recently submitted a PR to k3d which updated the documentation for how to run CUDA workloads: https://k3d.io/v5.6.3/usage/advanced/cuda

It also updated to 0.15.0-rc.2 of the nvidia device plugin as mentioned by @elezar. In my testing on WSL it was working without issues. Do you mind testing out using the new docs and see if that fixes it?

@qingfengfenga
Copy link
Author

qingfengfenga commented Apr 16, 2024

@elezar @dbreyfogle After using 0.15.0-rc.2, K3D on WLS2 can run CUDA workload normally. Thank you for your work and we look forward to the official release of 0.15 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants