Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect deviceClassWhitelist configuration is provided #957

Open
fprzewozny opened this issue May 27, 2024 · 2 comments
Open

Incorrect deviceClassWhitelist configuration is provided #957

fprzewozny opened this issue May 27, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@fprzewozny
Copy link

fprzewozny commented May 27, 2024

Hey,
Going through live system configuration I have noticed, that network-operator-node-feature-discovery-worker-conf contains incorrect device class whitelist:

apiVersion: v1
data:
 nfd-worker.conf: |-
  sources:
   pci:
    deviceClassWhitelist:
    - "0300"
    - "0302"
    deviceLabelFields:
    - vendor
kind: ConfigMap

According to PCI-SIG specifications, base class 03 is Display controller, 00 subclass of 03 class is VGA-compatible controller, and 02 subclass of 03 class is 3D controller . So provided configuration with operator translates to:

    deviceClassWhitelist:
    - "0300"  # VGA-compatible controller
    - "0302"  # 3D controller

With such filters it seems like network-operator-node-feature-discovery is configured to gather GPU data (that should be done with f.e. https://github.com/NVIDIA/gpu-feature-discovery, which have similar configuration issue I will link here once it's created). In my opinion, deviceClassWhitelist should contain entries only from 02 classes (Network).

In code repo it can be found here:

and

In my opinion, deviceClassWhitelist for network-operator should contain only 0200, and 0207 entries.

Thank you,
Franciszek

@fprzewozny
Copy link
Author

fprzewozny commented May 27, 2024

Created a bug against gpu-feature-discovery as well: NVIDIA/k8s-device-plugin#729

@adrianchiris
Copy link
Collaborator

Hi @fprzewozny we use NFD (Node Feature Discovery) NodeFeature API[1] and deploy a NodeFeatureRule[2][3] obj that will trigger NFD to label the node with expected labels required for network-operator (feature.node.kubernetes.io/pci-15b3.present).

we keep GPUs in deviceClassWhitelist expose the default GPU related labels by NFD. thats needed when using NVIDIA GPU Operator.
reason being we expect only one instance of NFD deployed in the cluster.

[1]https://kubernetes-sigs.github.io/node-feature-discovery/v0.16/usage/customization-guide.html#nodefeature-custom-resource
[2]https://kubernetes-sigs.github.io/node-feature-discovery/v0.16/usage/customization-guide.html#nodefeaturerule-custom-resource
[3]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants