Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to initialize NVML: ERROR_UNKNOWN #452

Open
wangzheyuan opened this issue Aug 22, 2024 · 4 comments
Open

Failed to initialize NVML: ERROR_UNKNOWN #452

wangzheyuan opened this issue Aug 22, 2024 · 4 comments

Comments

@wangzheyuan
Copy link

If I install hami without privileged=true in daemonsetnvidia.yaml, device-plugin is CrashLoopBackOff.
Here is device-plugin's log:

I0821 10:02:57.139613    9897 client.go:53] BuildConfigFromFlags failed for file /root/.kube/config: stat /root/.kube/config: no such file or directory using inClusterConfig
I0821 10:02:57.150807    9897 main.go:157] Starting FS watcher.
I0821 10:02:57.150849    9897 main.go:166] Start working on node gpu-4090
I0821 10:02:57.150852    9897 main.go:167] Starting OS watcher.
I0821 10:02:57.172809    9897 main.go:182] Starting Plugins.
I0821 10:02:57.172833    9897 main.go:240] Loading configuration.
I0821 10:02:57.172943    9897 vgpucfg.go:130] flags= [--mig-strategy value	the desired strategy for exposing MIG devices on GPUs that support it:
		[none | single | mixed] (default: "none") [$MIG_STRATEGY] --fail-on-init-error	fail the plugin if an error is encountered during initialization, otherwise block indefinitely (default: true) [$FAIL_ON_INIT_ERROR] --nvidia-driver-root value	the root path for the NVIDIA driver installation (typical values are '/' or '/run/nvidia/driver') (default: "/") [$NVIDIA_DRIVER_ROOT] --pass-device-specs	pass the list of DeviceSpecs to the kubelet on Allocate() (default: false) [$PASS_DEVICE_SPECS] --device-list-strategy value [ --device-list-strategy value ]	the desired strategy for passing the device list to the underlying runtime:
		[envvar | volume-mounts | cdi-annotations] (default: "envvar") [$DEVICE_LIST_STRATEGY] --device-id-strategy value	the desired strategy for passing device IDs to the underlying runtime:
		[uuid | index] (default: "uuid") [$DEVICE_ID_STRATEGY] --gds-enabled	ensure that containers are started with NVIDIA_GDS=enabled (default: false) [$GDS_ENABLED] --mofed-enabled	ensure that containers are started with NVIDIA_MOFED=enabled (default: false) [$MOFED_ENABLED] --config-file value	the path to a config file as an alternative to command line options or environment variables [$CONFIG_FILE] --cdi-annotation-prefix value	the prefix to use for CDI container annotation keys (default: "cdi.k8s.io/") [$CDI_ANNOTATION_PREFIX] --nvidia-ctk-path value	the path to use for the nvidia-ctk in the generated CDI specification (default: "/usr/bin/nvidia-ctk") [$NVIDIA_CTK_PATH] --container-driver-root value	the path where the NVIDIA driver root is mounted in the container; used for generating CDI specifications (default: "/driver-root") [$CONTAINER_DRIVER_ROOT] --node-name value	node name (default: "evecom-4090") [$NodeName] --device-split-count value	the number for NVIDIA device split (default: 2) [$DEVICE_SPLIT_COUNT] --device-memory-scaling value	the ratio for NVIDIA device memory scaling (default: 1) [$DEVICE_MEMORY_SCALING] --device-cores-scaling value	the ratio for NVIDIA device cores scaling (default: 1) [$DEVICE_CORES_SCALING] --disable-core-limit	If set, the core utilization limit will be ignored (default: false) [$DISABLE_CORE_LIMIT] --resource-name value	the name of field for number GPU visible in container (default: "nvidia.com/gpu") --help, -h	show help --version, -v	print the version]
I0821 10:02:57.173052    9897 vgpucfg.go:139] DeviceMemoryScaling 1
I0821 10:02:57.173143    9897 vgpucfg.go:108] Device Plugin Configs: {[{m5-cloudinfra-online02 1.8 0 10 none}]}
I0821 10:02:57.173147    9897 main.go:255] Updating config with default resource matching patterns.
I0821 10:02:57.173269    9897 main.go:266] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "ResourceName": "nvidia.com/gpu",
  "DebugMode": null
}
I0821 10:02:57.173272    9897 main.go:269] Retrieving plugins.
I0821 10:02:57.173609    9897 factory.go:107] Detected NVML platform: found NVML library
I0821 10:02:57.173628    9897 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
config= [{* nvidia.com/gpu}]
E0821 10:02:57.199580    9897 factory.go:77] Failed to initialize NVML: ERROR_UNKNOWN.
E0821 10:02:57.199624    9897 factory.go:78] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0821 10:02:57.199627    9897 factory.go:79] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0821 10:02:57.199630    9897 factory.go:80] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0821 10:02:57.199632    9897 factory.go:81] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0821 10:02:57.214373    9897 main.go:126] error starting plugins: error creating plugin manager: unable to create plugin manager: nvml init failed: ERROR_UNKNOWN

If I installed hami with privileged=true in daemonsetnvidia.yaml, device-plugin works well. However, containers that request vGPU will encounter following error:

[root@gpu-4090 ~]# cat test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod2
  namespace: emlp
spec:
  runtimeClassName: nvidia
  containers:
    - name: test
      image: nvidia/cuda:12.1.0-base-ubuntu18.04
      imagePullPolicy: IfNotPresent
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          nvidia.com/gpu: 1

root@gpu-pod2:/# nvidia-smi
Failed to initialize NVML: ERROR_UNKNOWN

Here is vgpu-scheduler-extender's log:

I0822 02:42:53.118255       1 route.go:131] Start to handle webhook request on /webhook
I0822 02:42:53.118693       1 webhook.go:63] Processing admission hook for pod emlp/gpu-pod2, UID: 14ea5878-ce25-4f32-bb1e-cf6d4b42c398
I0822 02:42:53.154054       1 route.go:44] Into Predicate Route inner func
I0822 02:42:53.154209       1 scheduler.go:435] "begin schedule filter" pod="gpu-pod2" uuid="3876900e-cf59-49f0-b2f0-65fb84c8cdb9" namespaces="emlp"
I0822 02:42:53.154220       1 device.go:241] Counting mlu devices
I0822 02:42:53.154226       1 device.go:175] Counting dcu devices
I0822 02:42:53.154229       1 device.go:166] Counting iluvatar devices
I0822 02:42:53.154234       1 device.go:195] Counting ascend 910B devices
I0822 02:42:53.154238       1 ascend310p.go:209] Counting Ascend310P devices
I0822 02:42:53.154249       1 pod.go:40] "collect requestreqs" counts=[{"NVIDIA":{"Nums":1,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0}}]
I0822 02:42:53.154272       1 score.go:32] devices status
I0822 02:42:53.154285       1 score.go:34] "device status" device id="GPU-f4a6984d-1947-3b2c-03fe-40586909cbad" device detail={"Device":{"ID":"GPU-f4a6984d-1947-3b2c-03fe-40586909cbad","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":24564,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA GeForce RTX 4090","Health":true},"Score":0}
I0822 02:42:53.154294       1 score.go:34] "device status" device id="GPU-4048ae23-1753-4d20-96d0-16be28f65017" device detail={"Device":{"ID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Index":0,"Used":0,"Count":10,"Usedmem":0,"Totalmem":24564,"Totalcore":100,"Usedcores":0,"Numa":0,"Type":"NVIDIA-NVIDIA GeForce RTX 4090","Health":true},"Score":0}
I0822 02:42:53.154301       1 node_policy.go:61] node gpu-4090 used 0, usedCore 0, usedMem 0,
I0822 02:42:53.154306       1 node_policy.go:73] node gpu-4090 computer score is 0.000000
I0822 02:42:53.154314       1 gpu_policy.go:70] device GPU-f4a6984d-1947-3b2c-03fe-40586909cbad user 0, userCore 0, userMem 0,
I0822 02:42:53.154317       1 gpu_policy.go:76] device GPU-f4a6984d-1947-3b2c-03fe-40586909cbad computer score is 11.000000
I0822 02:42:53.154319       1 gpu_policy.go:70] device GPU-4048ae23-1753-4d20-96d0-16be28f65017 user 0, userCore 0, userMem 0,
I0822 02:42:53.154321       1 gpu_policy.go:76] device GPU-4048ae23-1753-4d20-96d0-16be28f65017 computer score is 11.000000
I0822 02:42:53.154329       1 score.go:68] "Allocating device for container request" pod="emlp/gpu-pod2" card request={"Nums":1,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0}
I0822 02:42:53.154345       1 score.go:72] "scoring pod" pod="emlp/gpu-pod2" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=1 device="GPU-4048ae23-1753-4d20-96d0-16be28f65017"
I0822 02:42:53.154352       1 score.go:60] checkUUID result is true for NVIDIA type
I0822 02:42:53.154358       1 score.go:124] "first fitted" pod="emlp/gpu-pod2" device="GPU-4048ae23-1753-4d20-96d0-16be28f65017"
I0822 02:42:53.154366       1 score.go:135] "device allocate success" pod="emlp/gpu-pod2" allocate device={"NVIDIA":[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]}
I0822 02:42:53.154370       1 scheduler.go:470] nodeScores_len= 1
I0822 02:42:53.154373       1 scheduler.go:473] schedule emlp/gpu-pod2 to evegpucom-4090 map[NVIDIA:[[{0 GPU-4048ae23-1753-4d20-96d0-16be28f65017 NVIDIA 24564 0}]]]
I0822 02:42:53.154388       1 util.go:146] Encoded container Devices: GPU-4048ae23-1753-4d20-96d0-16be28f65017,NVIDIA,24564,0:
I0822 02:42:53.154390       1 util.go:169] Encoded pod single devices GPU-4048ae23-1753-4d20-96d0-16be28f65017,NVIDIA,24564,0:;
I0822 02:42:53.154395       1 pods.go:63] Pod added: Name: gpu-pod2, UID: 3876900e-cf59-49f0-b2f0-65fb84c8cdb9, Namespace: emlp, NodeID: gpu-4090
I0822 02:42:53.162102       1 scheduler.go:368] "Bind" pod="gpu-pod2" namespace="emlp" podUID="3876900e-cf59-49f0-b2f0-65fb84c8cdb9" node="gpu-4090"
I0822 02:42:53.162380       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.169721       1 device.go:241] Counting mlu devices
I0822 02:42:53.193546       1 nodelock.go:62] "Node lock set" node="gpu-4090"
I0822 02:42:53.203870       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.207552       1 scheduler.go:421] After Binding Process
I0822 02:42:53.208761       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.254593       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.267133       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.312288       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:53.691029       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:54.525899       1 util.go:237] "Decoded pod annos" poddevices={"NVIDIA":[[{"Idx":0,"UUID":"GPU-4048ae23-1753-4d20-96d0-16be28f65017","Type":"NVIDIA","Usedmem":24564,"Usedcores":0}]]}
I0822 02:42:57.451349       1 scheduler.go:195] "New timestamp" hami.io/node-handshake="Requesting_2024.08.22 02:42:57" nodeName="gpu-4090"
I0822 02:42:57.473008       1 util.go:137] Encoded node Devices: GPU-f4a6984d-1947-3b2c-03fe-40586909cbad,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090,0,true:GPU-4048ae23-1753-4d20-96d0-16be28f65017,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090,0,true:
I0822 02:43:27.568147       1 scheduler.go:195] "New timestamp" hami.io/node-handshake="Requesting_2024.08.22 02:43:27" nodeName="gpu-4090"
I0822 02:43:27.597171       1 util.go:137] Encoded node Devices: GPU-f4a6984d-1947-3b2c-03fe-40586909cbad,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090,0,true:GPU-4048ae23-1753-4d20-96d0-16be28f65017,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090,0,true:

Ubuntu: 22.04.4
Kubernetes: RKE2 1.28.12
Containerd: v1.7.17-k3s1
NVIDIA Container Toolkit: 1.15.0

root@gpu-4090:~# nvidia-smi
Tue Aug 20 16:49:31 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   39C    P8             34W /  450W |      20MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off |   00000000:04:00.0 Off |                  Off |
|  0%   34C    P8             23W /  450W |      20MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1764      G   /usr/lib/xorg/Xorg                              9MiB |
|    0   N/A  N/A      2153      G   /usr/bin/gnome-shell                           10MiB |
|    1   N/A  N/A      1764      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

root@gpu-4090:~# cat /var/lib/rancher/rke2/agent/etc/containerd/config.toml
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/rke2/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "index.docker.io/rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/var/lib/rancher/rke2/agent/etc/containerd/certs.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
  SystemdCgroup = true

root@gpu-4090:~# cat hami/values.yaml
scheduler:
  kubeScheduler:
    image: registry.k8s.io/kube-scheduler
    imageTag: v1.28.12
  nodeSelector:
    kubernetes.io/hostname: gpu-4090
devicePlugin:
  runtimeClassName: nvidia
@archlitchi
Copy link
Collaborator

it seems your nvidia-driver may not be installed correctly, you can try install nvidia-device-plugin v0.14, can see if that can be launched correctly

@wangzheyuan
Copy link
Author

wangzheyuan commented Aug 22, 2024

NVIDIA GPU Operator works fine, but nvidia-device-plugin v0.14.5 has the same error:

I0822 08:51:42.921468       1 main.go:154] Starting FS watcher.
I0822 08:51:42.921503       1 main.go:161] Starting OS watcher.
I0822 08:51:42.921566       1 main.go:176] Starting Plugins.
I0822 08:51:42.921574       1 main.go:234] Loading configuration.
I0822 08:51:42.921623       1 main.go:242] Updating config with default resource matching patterns.
I0822 08:51:42.921704       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0822 08:51:42.921708       1 main.go:256] Retreiving plugins.
I0822 08:51:42.921955       1 factory.go:107] Detected NVML platform: found NVML library
I0822 08:51:42.921968       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0822 08:51:42.925620       1 factory.go:77] Failed to initialize NVML: ERROR_UNKNOWN.
E0822 08:51:42.925629       1 factory.go:78] If this is a GPU node, did you set the docker default runtime to `nvidia`?
E0822 08:51:42.925630       1 factory.go:79] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0822 08:51:42.925632       1 factory.go:80] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0822 08:51:42.925634       1 factory.go:81] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0822 08:51:42.925723       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: nvml init failed: ERROR_UNKNOWN

@lengrongfu
Copy link
Contributor

you can look toolkit pod log.

@wangzheyuan
Copy link
Author

you can look toolkit pod log.

You mean NVIDIA Container Toolkit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants