Sync NFD Helm Charts and config with the GPU-Operator #557

ArangoGutierrez · 2023-06-12T14:21:45Z

This Patch:

Enhance READE NFD documentation
Updates NFD Chart to V0.13.2
Sync nfd values at Values.yaml with those in the GPU-Operator

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

rollandf

Regarding the values,yaml, it is generated by make release-build.

So the changes should be done in the template here:
https://github.com/Mellanox/network-operator/blob/master/hack/templates/values/values.template#L19

deployment/network-operator/values.yaml

deployment/network-operator/Chart.yaml

abdallahyas · 2023-06-13T08:29:47Z

/retest-nic_operator_kind

abdallahyas · 2023-06-13T09:46:21Z

/retest-nic_operator_kind

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez · 2023-06-20T10:06:16Z

/retest-nic_operator_helm

ArangoGutierrez · 2023-06-20T10:14:56Z

/retest-nic_operator_kind

ArangoGutierrez · 2023-06-21T09:03:28Z

@adrianchiris @rollandf PTAL

rollandf

LGTM

deployment/network-operator/templates/upgrade-crd.yaml

deployment/network-operator/values.yaml

hack/templates/values/values.template

adrianchiris · 2023-06-26T13:30:03Z

can be merged once CI is green

ArangoGutierrez · 2023-06-26T14:46:04Z

/retest-nic_operator_helm

ArangoGutierrez · 2023-06-26T15:37:01Z

/retest-nic_operator_helm

ArangoGutierrez · 2023-06-26T17:10:16Z

/retest-nic_operator_helm

ArangoGutierrez · 2023-06-27T08:18:35Z

/retest-nic_operator_helm

ArangoGutierrez · 2023-06-27T12:11:22Z

/retest-nic_operator_helm

adrianchiris · 2023-06-27T17:29:14Z

looking at the CI logs (ATM they are not available for query from github due to internal issue)

i see that when deploying network-operator via helm nicclusterpolicy CR fails to become ready.

it deploys it with a mock container of MOFED to avoid log compile time as well as with rdma-shared-device-plugin.
both of these states are not ready. it will be in this state if either pods fail to start or if they are not scheduled on any node.

the latter may happen if NFD did not label nodes with: feature.node.kubernetes.io/pci-15b3.present: "true"

15:35:03 TASK [projects-common/wait-k8s-resource : Wait Resource nicclusterpolicies status field to become available] ***
15:35:03 task path: /jenkins/workspace/nic_operator_helm-ci/ansible/roles/projects-common/wait-k8s-resource/tasks/main.yml:15
15:35:03 redirecting (type: modules) ansible.builtin.k8s_info to community.kubernetes.k8s_info
15:35:03 redirecting (type: modules) community.kubernetes.k8s_info to kubernetes.core.k8s_info
15:35:04 FAILED - RETRYING: Wait Resource nicclusterpolicies status field to become available (60 retries left).
15:35:14 redirecting (type: modules) ansible.builtin.k8s_info to community.kubernetes.k8s_info
15:35:14 redirecting (type: modules) community.kubernetes.k8s_info to kubernetes.core.k8s_info
15:35:16 ok: [localhost] => {"api_found": true, "attempts": 2, "changed": false, "resources": [{"apiVersion": "mellanox.com/v1alpha1", "kind": "NicClusterPolicy", "metadata": {"annotations": {"meta.helm.sh/release-name": "network-operator-helm-ci", "meta.helm.sh/release-namespace": "sriov-network-operator"}, "creationTimestamp": "2023-06-27T12:34:49Z", "generation": 1, "labels": {"app.kubernetes.io/managed-by": "Helm"}, "managedFields": [{"apiVersion": "mellanox.com/v1alpha1", "fieldsType": "FieldsV1", "fieldsV1": {"f:metadata": {"f:annotations": {".": {}, "f:meta.helm.sh/release-name": {}, "f:meta.helm.sh/release-namespace": {}}, "f:labels": {".": {}, "f:app.kubernetes.io/managed-by": {}}}, "f:spec": {".": {}, "f:ofedDriver": {".": {}, "f:image": {}, "f:imagePullSecrets": {}, "f:livenessProbe": {".": {}, "f:initialDelaySeconds": {}, "f:periodSeconds": {}}, "f:readinessProbe": {".": {}, "f:initialDelaySeconds": {}, "f:periodSeconds": {}}, "f:repository": {}, "f:startupProbe": {".": {}, "f:initialDelaySeconds": {}, "f:periodSeconds": {}}, "f:terminationGracePeriodSeconds": {}, "f:upgradePolicy": {".": {}, "f:autoUpgrade": {}, "f:drain": {".": {}, "f:deleteEmptyDir": {}, "f:enable": {}, "f:force": {}, "f:podSelector": {}, "f:timeoutSeconds": {}}, "f:maxParallelUpgrades": {}}, "f:version": {}}, "f:psp": {".": {}, "f:enabled": {}}, "f:rdmaSharedDevicePlugin": {".": {}, "f:config": {}, "f:image": {}, "f:imagePullSecrets": {}, "f:repository": {}, "f:version": {}}, "f:secondaryNetwork": {".": {}, "f:cniPlugins": {".": {}, "f:image": {}, "f:imagePullSecrets": {}, "f:repository": {}, "f:version": {}}, "f:ipamPlugin": {".": {}, "f:image": {}, "f:imagePullSecrets": {}, "f:repository": {}, "f:version": {}}, "f:multus": {".": {}, "f:image": {}, "f:imagePullSecrets": {}, "f:repository": {}, "f:version": {}}}}}, "manager": "Go-http-client", "operation": "Update", "time": "2023-06-27T12:34:49Z"}, {"apiVersion": "mellanox.com/v1alpha1", "fieldsType": "FieldsV1", "fieldsV1": {"f:status": {".": {}, "f:appliedStates": {}, "f:state": {}}}, "manager": "manager", "operation": "Update", "time": "2023-06-27T12:35:10Z"}], "name": "nic-cluster-policy", "resourceVersion": "2694", "uid": "27cdd5ac-0c5e-4333-9a02-3e172cd6a0cd"}, "spec": {"ofedDriver": {"image": "modified-mofed", "imagePullSecrets": [], "livenessProbe": {"initialDelaySeconds": 30, "periodSeconds": 30}, "readinessProbe": {"initialDelaySeconds": 10, "periodSeconds": 30}, "repository": "docker.io/mellanox", "startupProbe": {"initialDelaySeconds": 10, "periodSeconds": 20}, "terminationGracePeriodSeconds": 300, "upgradePolicy": {"autoUpgrade": false, "drain": {"deleteEmptyDir": false, "enable": true, "force": false, "podSelector": "", "timeoutSeconds": 300}, "maxParallelUpgrades": 1}, "version": "23.04-0.5.3.3.1"}, "psp": {"enabled": false}, "rdmaSharedDevicePlugin": {"config": "{\n  \"configList\": [\n    {\n      \"resourceName\": \"rdma_shared_devices_a\",\n      \"rdmaHcaMax\": 1000,\n      \"selectors\": {\n        \"vendors\": [],\n        \"deviceIDs\": [],\n        \"drivers\": [],\n        \"ifNames\": [\"enp4s0f0\",\"enp4s0f1\"],\n        \"linkTypes\": []\n      }\n    }\n  ]\n}\n", "image": "k8s-rdma-shared-dev-plugin", "imagePullSecrets": [], "repository": "nvcr.io/nvidia/cloud-native", "version": "v1.3.2"}, "secondaryNetwork": {"cniPlugins": {"image": "plugins", "imagePullSecrets": [], "repository": "ghcr.io/k8snetworkplumbingwg", "version": "v1.2.0-amd64"}, "ipamPlugin": {"image": "whereabouts", "imagePullSecrets": [], "repository": "ghcr.io/k8snetworkplumbingwg", "version": "v0.6.1-amd64"}, "multus": {"image": "multus-cni", "imagePullSecrets": [], "repository": "ghcr.io/k8snetworkplumbingwg", "version": "v3.9.3"}}}, "status": {"appliedStates": [{"name": "state-pod-security-policy", "state": "ignore"}, {"name": "state-multus-cni", "state": "notReady"}, {"name": "state-container-networking-plugins", "state": "notReady"}, {"name": "state-ipoib-cni", "state": "ignore"}, {"name": "state-whereabouts-cni", "state": "notReady"}, {"name": "state-OFED", "state": "notReady"}, {"name": "state-SRIOV-device-plugin", "state": "ignore"}, {"name": "state-RDMA-device-plugin", "state": "notReady"}, {"name": "state-NV-Peer", "state": "ignore"}, {"name": "state-ib-kubernetes", "state": "ignore"}, {"name": "state-nv-ipam-cni", "state": "ignore"}], "state": "notReady"}}]}
15:35:16 
15:35:16 TASK [projects-common/wait-k8s-resource : Wait Resource nicclusterpolicies to be ready] ***
15:35:16 task path: /jenkins/workspace/nic_operator_helm-ci/ansible/roles/projects-common/wait-k8s-resource/tasks/main.yml:29
15:35:16 redirecting (type: modules) ansible.builtin.k8s_info to community.kubernetes.k8s_info
15:35:16 redirecting (type: modules) community.kubernetes.k8s_info to kubernetes.core.k8s_info
15:35:17 FAILED - RETRYING: Wait Resource nicclusterpolicies to be ready (60 retries left).
15:35:27 redirecting (type: modules) ansible.builtin.k8s_info to community.kubernetes.k8s_info
15:35:27 redirecting (type: modules) community.kubernetes.k8s_info to kubernetes.core.k8s_info
15:35:28 FAILED - RETRYING: Wait Resource nicclusterpolicies to be ready (59 retries left).

adrianchiris · 2023-06-27T17:38:41Z

comparing a successfull run of another pr i can see that when getting node information feature.node.kubernetes.io/pci-15b3.present: "true" label is present but going through logs of the CI job of this PR the label is missing.

adrianchiris · 2023-07-02T14:06:43Z

deployment/network-operator/values.yaml

@@ -18,6 +18,7 @@

 nfd:
  enabled: true
+  nodeFeatureRules: false


question: where is this being used ?

GFD (from the GPU feature discovery) will start using it soon, for this release we have it optional, but for the next one we plan to change it to enabled by default

this section is not propagated to node-feature-discovery chart. and currently we dont deploy anything related to GFD via helm.

should this be under node-feature-discovery ? (although i didnt find this parameter in node-feature-discovery chart valuses)

we have enableNodeFeatureApi under node-feature-discovery section so i guess the parameter here is not needed ?

adrianchiris · 2023-07-02T14:46:56Z

i have taken a look at the CI issue. if we enable node feature rule feature in NFD need to enable node feature controller or crd-controller as its the entity which labels the node.

currently both are defaulted to null AND since we use the instance namespace it will not enable it by default.

https://github.com/kubernetes-sigs/node-feature-discovery/blob/10bbc8f253a3ebf4cb77bb5e7c2c910040065cf3/deployment/helm/node-feature-discovery/templates/master.yaml#L108C30-L108C40

also i saw there is a typo in the chart WRT crd-controller. see: https://github.com/kubernetes-sigs/node-feature-discovery/blob/10bbc8f253a3ebf4cb77bb5e7c2c910040065cf3/deployment/helm/node-feature-discovery/templates/master.yaml#L105

ill submit a PR to fix it.

ArangoGutierrez · 2023-07-03T13:36:01Z

CI is green now! thanks for your help understanding the logs @adrianchiris
cc @rollandf

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez · 2023-07-04T10:24:53Z

/retest-nic_operator_kind

adrianchiris

LGTM ! lets wait for CI to pass and can be merged.

ArangoGutierrez · 2023-07-04T12:02:41Z

All green!!!

Enhance NFD section at README

426f37c

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

rollandf requested changes Jun 13, 2023

View reviewed changes

adrianchiris reviewed Jun 13, 2023

View reviewed changes

deployment/network-operator/values.yaml Outdated Show resolved Hide resolved

deployment/network-operator/Chart.yaml Show resolved Hide resolved

ArangoGutierrez force-pushed the nfd_nfr branch from c1eccbc to eacf0e9 Compare June 19, 2023 09:18

ArangoGutierrez requested review from rollandf and adrianchiris June 19, 2023 09:18

ArangoGutierrez force-pushed the nfd_nfr branch from eacf0e9 to 4593fe3 Compare June 19, 2023 09:38

Sync NFD Chart values with those in the GPU-Operator

d1c2b99

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez force-pushed the nfd_nfr branch from 4593fe3 to d1c2b99 Compare June 20, 2023 09:03

rollandf approved these changes Jun 21, 2023

View reviewed changes

adrianchiris reviewed Jun 25, 2023

View reviewed changes

deployment/network-operator/templates/upgrade-crd.yaml Outdated Show resolved Hide resolved

adrianchiris reviewed Jun 25, 2023

View reviewed changes

deployment/network-operator/values.yaml Show resolved Hide resolved

adrianchiris reviewed Jun 25, 2023

View reviewed changes

deployment/network-operator/values.yaml Show resolved Hide resolved

ArangoGutierrez requested a review from adrianchiris June 26, 2023 09:30

adrianchiris reviewed Jun 26, 2023

View reviewed changes

hack/templates/values/values.template Show resolved Hide resolved

ArangoGutierrez force-pushed the nfd_nfr branch from 25c95de to fe4909c Compare June 26, 2023 13:24

ArangoGutierrez requested a review from adrianchiris June 26, 2023 13:24

ArangoGutierrez force-pushed the nfd_nfr branch from fe4909c to d559268 Compare June 26, 2023 13:27

adrianchiris approved these changes Jun 26, 2023

View reviewed changes

adrianchiris reviewed Jul 2, 2023

View reviewed changes

ArangoGutierrez force-pushed the nfd_nfr branch from d559268 to d6471c3 Compare July 3, 2023 10:02

ArangoGutierrez force-pushed the nfd_nfr branch from d6471c3 to 52eeae6 Compare July 4, 2023 10:03

Address reviewer requested changes

ca22320

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez force-pushed the nfd_nfr branch from 52eeae6 to ca22320 Compare July 4, 2023 10:12

adrianchiris approved these changes Jul 4, 2023

View reviewed changes

adrianchiris merged commit b83f735 into Mellanox:master Jul 4, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync NFD Helm Charts and config with the GPU-Operator #557

Sync NFD Helm Charts and config with the GPU-Operator #557

ArangoGutierrez commented Jun 12, 2023

rollandf left a comment

abdallahyas commented Jun 13, 2023

abdallahyas commented Jun 13, 2023

ArangoGutierrez commented Jun 20, 2023

ArangoGutierrez commented Jun 20, 2023

ArangoGutierrez commented Jun 21, 2023

rollandf left a comment

adrianchiris commented Jun 26, 2023

ArangoGutierrez commented Jun 26, 2023

ArangoGutierrez commented Jun 26, 2023

ArangoGutierrez commented Jun 26, 2023

ArangoGutierrez commented Jun 27, 2023

ArangoGutierrez commented Jun 27, 2023

adrianchiris commented Jun 27, 2023

adrianchiris commented Jun 27, 2023

adrianchiris Jul 2, 2023

ArangoGutierrez Jul 3, 2023

adrianchiris Jul 4, 2023

adrianchiris Jul 4, 2023

adrianchiris commented Jul 2, 2023 •

edited

Loading

ArangoGutierrez commented Jul 3, 2023

ArangoGutierrez commented Jul 4, 2023

adrianchiris left a comment

ArangoGutierrez commented Jul 4, 2023

Sync NFD Helm Charts and config with the GPU-Operator #557

Sync NFD Helm Charts and config with the GPU-Operator #557

Conversation

ArangoGutierrez commented Jun 12, 2023

rollandf left a comment

Choose a reason for hiding this comment

abdallahyas commented Jun 13, 2023

abdallahyas commented Jun 13, 2023

ArangoGutierrez commented Jun 20, 2023

ArangoGutierrez commented Jun 20, 2023

ArangoGutierrez commented Jun 21, 2023

rollandf left a comment

Choose a reason for hiding this comment

adrianchiris commented Jun 26, 2023

ArangoGutierrez commented Jun 26, 2023

ArangoGutierrez commented Jun 26, 2023

ArangoGutierrez commented Jun 26, 2023

ArangoGutierrez commented Jun 27, 2023

ArangoGutierrez commented Jun 27, 2023

adrianchiris commented Jun 27, 2023

adrianchiris commented Jun 27, 2023

adrianchiris Jul 2, 2023

Choose a reason for hiding this comment

ArangoGutierrez Jul 3, 2023

Choose a reason for hiding this comment

adrianchiris Jul 4, 2023

Choose a reason for hiding this comment

adrianchiris Jul 4, 2023

Choose a reason for hiding this comment

adrianchiris commented Jul 2, 2023 • edited Loading

ArangoGutierrez commented Jul 3, 2023

ArangoGutierrez commented Jul 4, 2023

adrianchiris left a comment

Choose a reason for hiding this comment

ArangoGutierrez commented Jul 4, 2023

adrianchiris commented Jul 2, 2023 •

edited

Loading