Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

GFD returns 'no labels generated from any source' #36

Closed
MichaelJendryke opened this issue Mar 22, 2023 · 6 comments
Closed

GFD returns 'no labels generated from any source' #36

MichaelJendryke opened this issue Mar 22, 2023 · 6 comments

Comments

@MichaelJendryke
Copy link

Dear all,

I have a setup of k3s and rancher on three nodes. One node has two Tesla T4 GPUs.

Running nvidia-smi on the node directly returns

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:41:00.0 Off |                    0 |
| N/A   34C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                        On | 00000000:A1:00.0 Off |                    0 |
| N/A   34C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Which tells me that the diver is installed correctly and I can proceed in the k3s guide.

The content of /var/lib/rancher/k3s/agent/etc/containerd/config.toml is,

version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true
  default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"


[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true



[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

, where I added the line default_runtime_name = "nvidia"

I continue with

  1. Node Feature Discovery (NFD)
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.7.0/deployments/static/nfd.yaml

and also

kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.12.1

which shows the following result in the logs of nfd

REQUEST Node: geo-node1
NFD-version: v0.6.0
Labels: map[
cpu-cpuid.ADX:true
cpu-cpuid.AESNI:true
cpu-cpuid.AVX:true
cpu-cpuid.AVX2:true
cpu-cpuid.FMA3:true
cpu-cpuid.SHA:true
cpu-cpuid.SSE4A:true
cpu-hardware_multithreading:true
cpu-rdt.RDTCMT:true
cpu-rdt.RDTL3CA:true
cpu-rdt.RDTMBM:true
cpu-rdt.RDTMON:true
iommu-enabled:true
kernel-config.NO_HZ:true
kernel-config.NO_HZ_IDLE:true
kernel-version.full:5.15.0-67-generic
kernel-version.major:5
kernel-version.minor:15
kernel-version.revision:0
memory-numa:true
nvidia.com/gfd.timestamp:1679476204
pci-102b.present:true
pci-10de.present:true
pci-10de.sriov.capable:true
storage-nonrotationaldisk:true
system-os_release.ID:ubuntu
system-os_release.VERSION_ID:22.04
system-os_release.VERSION_ID.major:22
system-os_release.VERSION_ID.minor:04
]

Which mentions nvidia and pci-10de, suggesting that the discovery was successful as I do not get these entries on my non-GPU nodes.

  1. NVIDIA GPU Feature Discovery (GFD)
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.7.0/deployments/static/gpu-feature-discovery-daemonset.yaml

After applying the above GFD daemonset and checking the logs

2023/03/22 09:10:04 Starting OS watcher.
2023/03/22 09:10:04 Loading configuration.
2023/03/22 09:10:04 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2023/03/22 09:10:04 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2023/03/22 09:10:04 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/03/22 09:10:04 WARNING: No valid resources detected; using empty manager.
2023/03/22 09:10:04 Start running
2023/03/22 09:10:04 Warning: no labels generated from any source
2023/03/22 09:10:04 Writing labels to output file
2023/03/22 09:10:04 Sleeping for 60000000000
2023/03/22 09:11:04 Warning: no labels generated from any source
2023/03/22 09:11:04 Writing labels to output file
2023/03/22 09:11:04 Sleeping for 60000000000

It says that no labels were generated. Because of the warning WARNING: No valid resources detected; using empty manager.?

As nfd seems to work but gfd does not I exec into the gfd DS to run gpu-feature-discovery from the command line. No luck here to get another output.

Notes
I tried this with nvidia-container-toolkit 1.12.1 and 1.13.0-rc.2

nvidia related packages are

+++-==================================-==========================-============-=========================================================
un  libgldispatch0-nvidia              <none>                     <none>       (no description available)
ii  libnvidia-cfg1-530:amd64           530.30.02-0ubuntu1         amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                 <none>                     <none>       (no description available)
un  libnvidia-common                   <none>                     <none>       (no description available)
ii  libnvidia-common-530               530.30.02-0ubuntu1         all          Shared files used by the NVIDIA libraries
un  libnvidia-compute                  <none>                     <none>       (no description available)
rc  libnvidia-compute-515-server:amd64 515.86.01-0ubuntu0.22.04.2 amd64        NVIDIA libcompute package
ii  libnvidia-compute-530:amd64        530.30.02-0ubuntu1         amd64        NVIDIA libcompute package
ii  libnvidia-container-tools          1.13.0~rc.2-1              amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64         1.13.0~rc.2-1              amd64        NVIDIA container runtime library
un  libnvidia-decode                   <none>                     <none>       (no description available)
ii  libnvidia-decode-530:amd64         530.30.02-0ubuntu1         amd64        NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                   <none>                     <none>       (no description available)
ii  libnvidia-encode-530:amd64         530.30.02-0ubuntu1         amd64        NVENC Video Encoding runtime library
un  libnvidia-extra                    <none>                     <none>       (no description available)
ii  libnvidia-extra-530:amd64          530.30.02-0ubuntu1         amd64        Extra libraries for the NVIDIA driver
un  libnvidia-fbc1                     <none>                     <none>       (no description available)
ii  libnvidia-fbc1-530:amd64           530.30.02-0ubuntu1         amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                       <none>                     <none>       (no description available)
ii  libnvidia-gl-530:amd64             530.30.02-0ubuntu1         amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ml1                      <none>                     <none>       (no description available)
un  nvidia-384                         <none>                     <none>       (no description available)
un  nvidia-390                         <none>                     <none>       (no description available)
un  nvidia-common                      <none>                     <none>       (no description available)
un  nvidia-compute-utils               <none>                     <none>       (no description available)
rc  nvidia-compute-utils-515-server    515.86.01-0ubuntu0.22.04.2 amd64        NVIDIA compute utilities
ii  nvidia-compute-utils-530           530.30.02-0ubuntu1         amd64        NVIDIA compute utilities
un  nvidia-container-runtime           <none>                     <none>       (no description available)
un  nvidia-container-runtime-hook      <none>                     <none>       (no description available)
ii  nvidia-container-toolkit           1.13.0~rc.2-1              amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base      1.13.0~rc.2-1              amd64        NVIDIA Container Toolkit Base
rc  nvidia-dkms-515-server             515.86.01-0ubuntu0.22.04.2 amd64        NVIDIA DKMS package
ii  nvidia-dkms-530                    530.30.02-0ubuntu1         amd64        NVIDIA DKMS package
un  nvidia-dkms-kernel                 <none>                     <none>       (no description available)
ii  nvidia-driver-530                  530.30.02-0ubuntu1         amd64        NVIDIA driver metapackage
un  nvidia-driver-binary               <none>                     <none>       (no description available)
un  nvidia-fabricmanager               <none>                     <none>       (no description available)
ii  nvidia-fabricmanager-515           515.86.01-0ubuntu0.22.04.2 amd64        Fabric Manager for NVSwitch based systems.
un  nvidia-kernel-common               <none>                     <none>       (no description available)
rc  nvidia-kernel-common-515-server    515.86.01-0ubuntu0.22.04.2 amd64        Shared files used with the kernel module
ii  nvidia-kernel-common-530           530.30.02-0ubuntu1         amd64        Shared files used with the kernel module
un  nvidia-kernel-open                 <none>                     <none>       (no description available)
un  nvidia-kernel-open-530             <none>                     <none>       (no description available)
un  nvidia-kernel-source               <none>                     <none>       (no description available)
un  nvidia-kernel-source-515-server    <none>                     <none>       (no description available)
ii  nvidia-kernel-source-530           530.30.02-0ubuntu1         amd64        NVIDIA kernel source package
ii  nvidia-modprobe                    530.30.02-0ubuntu1         amd64        Load the NVIDIA kernel driver and create device files
un  nvidia-opencl-icd                  <none>                     <none>       (no description available)
un  nvidia-persistenced                <none>                     <none>       (no description available)
ii  nvidia-prime                       0.8.17.1                   all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                    530.30.02-0ubuntu1         amd64        Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary             <none>                     <none>       (no description available)
un  nvidia-smi                         <none>                     <none>       (no description available)
un  nvidia-utils                       <none>                     <none>       (no description available)
ii  nvidia-utils-530                   530.30.02-0ubuntu1         amd64        NVIDIA driver support binaries
ii  xserver-xorg-video-nvidia-530      530.30.02-0ubuntu1         amd64        NVIDIA binary Xorg driver
@klueska
Copy link
Collaborator

klueska commented Mar 22, 2023

On k3s you need to update /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl not /var/lib/rancher/k3s/agent/etc/containerd/config.toml, otherwise your config will get overwritten.

@MichaelJendryke
Copy link
Author

Thanks for the answer @klueska

I found
/etc/containerd/config.toml

#   Copyright 2018-2022 Docker Inc.

#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at

#       http://www.apache.org/licenses/LICENSE-2.0

#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.

disabled_plugins = ["cri"]

#root = "/var/lib/containerd"
#state = "/run/containerd"
#subreaper = true
#oom_score = 0

#[grpc]
#  address = "/run/containerd/containerd.sock"
#  uid = 0
#  gid = 0

#[debug]
#  address = "/run/containerd/debug.sock"
#  uid = 0
#  gid = 0
#  level = "info"

/var/lib/rancher/k3s/agent/etc/containerd/config.toml

version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true


[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"


[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true









[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

But as I did not have /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl, described here, I took the template from this blog post.

Restarting containerd overrides /var/lib/rancher/k3s/agent/etc/containerd/config.toml with

[plugins.opt]
  path = "/var/lib/rancher/k3s/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins.cri.cni]
  bin_dir = "/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"


[plugins.cri.containerd.runtimes.runc]
  # ---- changed from 'io.containerd.runc.v2' for GPU support
  runtime_type = "io.containerd.runtime.v1.linux"

# ---- added for GPU support
[plugins.linux]
  runtime = "nvidia-container-runtime"

But unfortunately GFD DS does not start after that

Warning  FailedCreatePodSandBox  14s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: cgroups: cgroup mountpoint does not exist: unknown

I guess the tmpl I found is outdated. Could you please point me to the docs to create this file?

@klueska
Copy link
Collaborator

klueska commented Mar 22, 2023

@MichaelJendryke
Copy link
Author

MichaelJendryke commented Mar 23, 2023

I have tried to follow tutorials that do not set the default runtime to nvidia (e.g. this). Instead I am trying to follow this. I have modified the NFD, GFD and NVIDIA-DEVICE-PLUGIN yaml files with runtimeClassName: nvidia, which results in the following output of kubectl describe node geo-node1 after GFD started.

Name:               geo-node1
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=k3s
                    beta.kubernetes.io/os=linux
                    egress.k3s.io/cluster=true
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.SHA=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                    feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                    feature.node.kubernetes.io/iommu-enabled=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=5.15.0-67-generic
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=15
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/memory-numa=true
                    feature.node.kubernetes.io/pci-102b.present=true
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-10de.sriov.capable=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    has_gpu=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=geo-node1
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=k3s
                    nvidia.com/cuda.driver.major=530
                    nvidia.com/cuda.driver.minor=30
                    nvidia.com/cuda.driver.rev=02
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=1
                    nvidia.com/gfd.timestamp=1679506166
                    nvidia.com/gpu.compute.major=7
                    nvidia.com/gpu.compute.minor=5
                    nvidia.com/gpu.count=2
                    nvidia.com/gpu.family=turing
                    nvidia.com/gpu.machine=PowerEdge-R7525
                    nvidia.com/gpu.memory=15360
                    nvidia.com/gpu.product=Tesla-T4
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=false
Annotations:        flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"8a:ec:37:1f:e5:1a"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: XXX.XXX.XXX.XXX
                    k3s.io/hostname: geo-node1
                    k3s.io/internal-ip: XXX.XXX.XXX.XXX
                    k3s.io/node-args: ["agent"]
                    k3s.io/node-config-hash: FZCHZFCL5KBSRTBWGCGIBHGDW6FDW2LIRBXCGNWFODJI3CKLHCOQ====
                    k3s.io/node-env:
                      {"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1","K3S_TOKEN":"********","K3S_U...
                    management.cattle.io/pod-limits: {}
                    management.cattle.io/pod-requests: {"pods":"4"}
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.FMA3,cpu-cpuid.SHA,cpu-cpuid.SSE4A,cpu-hardware_multithreading,cpu-rd...
                    nfd.node.kubernetes.io/master.version: v0.6.0
                    nfd.node.kubernetes.io/worker.version: v0.6.0
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 13 Mar 2023 08:47:09 +0100
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  geo-node1
  AcquireTime:     <unset>
  RenewTime:       Thu, 23 Mar 2023 08:32:08 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 23 Mar 2023 08:30:37 +0100   Wed, 22 Mar 2023 16:18:39 +0100   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  XXX.XXX.XXX.XXX
  Hostname:    geo-node1
Capacity:
  cpu:                128
  ephemeral-storage:  14625108Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  pods:               110
Allocatable:
  cpu:                128
  ephemeral-storage:  14227305052
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  pods:               110
System Info:
  Machine ID:                 f7b72f135bcc4a0195cd924d62fd6437
  System UUID:                4c4c4544-0056-5710-8030-c4c04f4a5433
  Boot ID:                    c6549fac-8176-4c3e-95f3-8e369f793af8
  Kernel Version:             5.15.0-67-generic
  OS Image:                   Ubuntu 22.04.1 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.15-k3s1
  Kubelet Version:            v1.25.6+k3s1
  Kube-Proxy Version:         v1.25.6+k3s1
PodCIDR:                      10.42.1.0/24
PodCIDRs:                     10.42.1.0/24
ProviderID:                   k3s://geo-node1
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
  default                     gpu-feature-discovery-d5txw             0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 nvidia-device-plugin-daemonset-gwwtd    0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 svclb-traefik-a0d27a00-wjvwp            0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d17h
  node-feature-discovery      nfd-spvb2                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         14h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests  Limits
  --------           --------  ------
  cpu                0 (0%)    0 (0%)
  memory             0 (0%)    0 (0%)
  ephemeral-storage  0 (0%)    0 (0%)
  hugepages-1Gi      0 (0%)    0 (0%)
  hugepages-2Mi      0 (0%)    0 (0%)
Events:              <none>

The labels are set, but Capacity and Allocatable do not mention GPUs, I assume there should be additional entries(?).

I found these issues helpful:

@MichaelJendryke
Copy link
Author

After some tinkering I can report, that I got it to work just fine. I forgot the runtimeClassName: nvidia in my Nvidia Device Plugin, after that everything went well.

For reference and in order:

  1. Get NFD to work with
# This template contains an example of running nfd-master and nfd-worker in the
# same pod.
#
apiVersion: v1
kind: Namespace
metadata:
  name: node-feature-discovery # NFD namespace
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nfd-master
  namespace: node-feature-discovery
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: nfd-master
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - patch
  - update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: nfd-master
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: nfd-master
subjects:
- kind: ServiceAccount
  name: nfd-master
  namespace: node-feature-discovery
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: nfd
  name: nfd
  namespace: node-feature-discovery
spec:
  selector:
    matchLabels:
      app: nfd
  template:
    metadata:
      labels:
        app: nfd
    spec:
      serviceAccount: nfd-master
      runtimeClassName: nvidia
      containers:
        - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
          name: nfd-master
          command:
            - "nfd-master"
          args:
            - "--extra-label-ns=nvidia.com"
        - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          image: quay.io/kubernetes_incubator/node-feature-discovery:v0.6.0
          name: nfd-worker
          command:
            - "nfd-worker"
          args:
            - "--sleep-interval=60s"
            - "--options={\"sources\": {\"pci\": { \"deviceLabelFields\": [\"vendor\"] }}}"
          volumeMounts:
            - name: host-boot
              mountPath: "/host-boot"
              readOnly: true
            - name: host-os-release
              mountPath: "/host-etc/os-release"
              readOnly: true
            - name: host-sys
              mountPath: "/host-sys"
            - name: source-d
              mountPath: "/etc/kubernetes/node-feature-discovery/source.d/"
            - name: features-d
              mountPath: "/etc/kubernetes/node-feature-discovery/features.d/"
      volumes:
        - name: host-boot
          hostPath:
            path: "/boot"
        - name: host-os-release
          hostPath:
            path: "/etc/os-release"
        - name: host-sys
          hostPath:
            path: "/sys"
        - name: source-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/source.d/"
        - name: features-d
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/features.d/"
  1. Get GFD running with
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-feature-discovery
  labels:
    app.kubernetes.io/name: gpu-feature-discovery
    app.kubernetes.io/version: 0.7.0
    app.kubernetes.io/part-of: nvidia-gpu
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: gpu-feature-discovery
      app.kubernetes.io/part-of: nvidia-gpu
  template:
    metadata:
      labels:
        app.kubernetes.io/name: gpu-feature-discovery
        app.kubernetes.io/version: 0.7.0
        app.kubernetes.io/part-of: nvidia-gpu
    spec:
      runtimeClassName: nvidia
      containers:
        - image: nvcr.io/nvidia/gpu-feature-discovery:v0.7.0
          name: gpu-feature-discovery
          volumeMounts:
            - name: output-dir
              mountPath: "/etc/kubernetes/node-feature-discovery/features.d"
            - name: host-sys
              mountPath: "/sys"
          securityContext:
            privileged: true
          env:
            - name: MIG_STRATEGY
              value: none
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              # On discrete-GPU based systems NFD adds the following lable where 10de is te NVIDIA PCI vendor ID
              - key: feature.node.kubernetes.io/pci-10de.present
                operator: In
                values:
                - "true"
            - matchExpressions:
              # On some Tegra-based systems NFD detects the CPU vendor ID as NVIDIA
              - key: feature.node.kubernetes.io/cpu-model.vendor_id
                operator: In
                values:
                - "NVIDIA"
            - matchExpressions:
              # We allow a GFD deployment to be forced by setting the following label to "true"
              - key: "nvidia.com/gpu.present"
                operator: In
                values:
                - "true"
      volumes:
        - name: output-dir
          hostPath:
            path: "/etc/kubernetes/node-feature-discovery/features.d"
        - name: host-sys
          hostPath:
            path: "/sys"

If this is running you should see labels being applied to your node.

  1. Get the NVIDIA device plugin to work with
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      runtimeClassName: nvidia
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Only after the device plugin finished, the kubectl describe node command will show

Capacity:
  cpu:                128
  ephemeral-storage:  14625108Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  nvidia.com/gpu:     2
  pods:               110
Allocatable:
  cpu:                128
  ephemeral-storage:  14227305052
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056410708Ki
  nvidia.com/gpu:     2
  pods:               110

After that you can run a GPU pod, such as documented in the k3s guide.

@elezar
Copy link
Member

elezar commented Feb 12, 2024

I'm closing this issue. The use of a runtime class allowed the labels to be generated.

@elezar elezar closed this as completed Feb 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants