Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync NFD Helm Charts and config with the GPU-Operator #557

Merged
merged 3 commits into from
Jul 4, 2023

Conversation

ArangoGutierrez
Copy link
Contributor

This Patch:

  • Enhance READE NFD documentation
  • Updates NFD Chart to V0.13.2
  • Sync nfd values at Values.yaml with those in the GPU-Operator

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copy link
Member

@rollandf rollandf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArangoGutierrez !!

Regarding the values,yaml, it is generated by make release-build.

So the changes should be done in the template here:
https://github.com/Mellanox/network-operator/blob/master/hack/templates/values/values.template#L19

@abdallahyas
Copy link
Contributor

/retest-nic_operator_kind

1 similar comment
@abdallahyas
Copy link
Contributor

/retest-nic_operator_kind

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez
Copy link
Contributor Author

/retest-nic_operator_helm

@ArangoGutierrez
Copy link
Contributor Author

/retest-nic_operator_kind

@ArangoGutierrez
Copy link
Contributor Author

@adrianchiris @rollandf PTAL

Copy link
Member

@rollandf rollandf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adrianchiris
Copy link
Collaborator

can be merged once CI is green

@ArangoGutierrez
Copy link
Contributor Author

/retest-nic_operator_helm

4 similar comments
@ArangoGutierrez
Copy link
Contributor Author

/retest-nic_operator_helm

@ArangoGutierrez
Copy link
Contributor Author

/retest-nic_operator_helm

@ArangoGutierrez
Copy link
Contributor Author

/retest-nic_operator_helm

@ArangoGutierrez
Copy link
Contributor Author

/retest-nic_operator_helm

@adrianchiris
Copy link
Collaborator

looking at the CI logs (ATM they are not available for query from github due to internal issue)

i see that when deploying network-operator via helm nicclusterpolicy CR fails to become ready.

it deploys it with a mock container of MOFED to avoid log compile time as well as with rdma-shared-device-plugin.
both of these states are not ready. it will be in this state if either pods fail to start or if they are not scheduled on any node.

the latter may happen if NFD did not label nodes with: feature.node.kubernetes.io/pci-15b3.present: "true"

15:35:03 TASK [projects-common/wait-k8s-resource : Wait Resource nicclusterpolicies status field to become available] ***
15:35:03 task path: /jenkins/workspace/nic_operator_helm-ci/ansible/roles/projects-common/wait-k8s-resource/tasks/main.yml:15
15:35:03 redirecting (type: modules) ansible.builtin.k8s_info to community.kubernetes.k8s_info
15:35:03 redirecting (type: modules) community.kubernetes.k8s_info to kubernetes.core.k8s_info
15:35:04 FAILED - RETRYING: Wait Resource nicclusterpolicies status field to become available (60 retries left).
15:35:14 redirecting (type: modules) ansible.builtin.k8s_info to community.kubernetes.k8s_info
15:35:14 redirecting (type: modules) community.kubernetes.k8s_info to kubernetes.core.k8s_info
15:35:16 ok: [localhost] => {"api_found": true, "attempts": 2, "changed": false, "resources": [{"apiVersion": "mellanox.com/v1alpha1", "kind": "NicClusterPolicy", "metadata": {"annotations": {"meta.helm.sh/release-name": "network-operator-helm-ci", "meta.helm.sh/release-namespace": "sriov-network-operator"}, "creationTimestamp": "2023-06-27T12:34:49Z", "generation": 1, "labels": {"app.kubernetes.io/managed-by": "Helm"}, "managedFields": [{"apiVersion": "mellanox.com/v1alpha1", "fieldsType": "FieldsV1", "fieldsV1": {"f:metadata": {"f:annotations": {".": {}, "f:meta.helm.sh/release-name": {}, "f:meta.helm.sh/release-namespace": {}}, "f:labels": {".": {}, "f:app.kubernetes.io/managed-by": {}}}, "f:spec": {".": {}, "f:ofedDriver": {".": {}, "f:image": {}, "f:imagePullSecrets": {}, "f:livenessProbe": {".": {}, "f:initialDelaySeconds": {}, "f:periodSeconds": {}}, "f:readinessProbe": {".": {}, "f:initialDelaySeconds": {}, "f:periodSeconds": {}}, "f:repository": {}, "f:startupProbe": {".": {}, "f:initialDelaySeconds": {}, "f:periodSeconds": {}}, "f:terminationGracePeriodSeconds": {}, "f:upgradePolicy": {".": {}, "f:autoUpgrade": {}, "f:drain": {".": {}, "f:deleteEmptyDir": {}, "f:enable": {}, "f:force": {}, "f:podSelector": {}, "f:timeoutSeconds": {}}, "f:maxParallelUpgrades": {}}, "f:version": {}}, "f:psp": {".": {}, "f:enabled": {}}, "f:rdmaSharedDevicePlugin": {".": {}, "f:config": {}, "f:image": {}, "f:imagePullSecrets": {}, "f:repository": {}, "f:version": {}}, "f:secondaryNetwork": {".": {}, "f:cniPlugins": {".": {}, "f:image": {}, "f:imagePullSecrets": {}, "f:repository": {}, "f:version": {}}, "f:ipamPlugin": {".": {}, "f:image": {}, "f:imagePullSecrets": {}, "f:repository": {}, "f:version": {}}, "f:multus": {".": {}, "f:image": {}, "f:imagePullSecrets": {}, "f:repository": {}, "f:version": {}}}}}, "manager": "Go-http-client", "operation": "Update", "time": "2023-06-27T12:34:49Z"}, {"apiVersion": "mellanox.com/v1alpha1", "fieldsType": "FieldsV1", "fieldsV1": {"f:status": {".": {}, "f:appliedStates": {}, "f:state": {}}}, "manager": "manager", "operation": "Update", "time": "2023-06-27T12:35:10Z"}], "name": "nic-cluster-policy", "resourceVersion": "2694", "uid": "27cdd5ac-0c5e-4333-9a02-3e172cd6a0cd"}, "spec": {"ofedDriver": {"image": "modified-mofed", "imagePullSecrets": [], "livenessProbe": {"initialDelaySeconds": 30, "periodSeconds": 30}, "readinessProbe": {"initialDelaySeconds": 10, "periodSeconds": 30}, "repository": "docker.io/mellanox", "startupProbe": {"initialDelaySeconds": 10, "periodSeconds": 20}, "terminationGracePeriodSeconds": 300, "upgradePolicy": {"autoUpgrade": false, "drain": {"deleteEmptyDir": false, "enable": true, "force": false, "podSelector": "", "timeoutSeconds": 300}, "maxParallelUpgrades": 1}, "version": "23.04-0.5.3.3.1"}, "psp": {"enabled": false}, "rdmaSharedDevicePlugin": {"config": "{\n  \"configList\": [\n    {\n      \"resourceName\": \"rdma_shared_devices_a\",\n      \"rdmaHcaMax\": 1000,\n      \"selectors\": {\n        \"vendors\": [],\n        \"deviceIDs\": [],\n        \"drivers\": [],\n        \"ifNames\": [\"enp4s0f0\",\"enp4s0f1\"],\n        \"linkTypes\": []\n      }\n    }\n  ]\n}\n", "image": "k8s-rdma-shared-dev-plugin", "imagePullSecrets": [], "repository": "nvcr.io/nvidia/cloud-native", "version": "v1.3.2"}, "secondaryNetwork": {"cniPlugins": {"image": "plugins", "imagePullSecrets": [], "repository": "ghcr.io/k8snetworkplumbingwg", "version": "v1.2.0-amd64"}, "ipamPlugin": {"image": "whereabouts", "imagePullSecrets": [], "repository": "ghcr.io/k8snetworkplumbingwg", "version": "v0.6.1-amd64"}, "multus": {"image": "multus-cni", "imagePullSecrets": [], "repository": "ghcr.io/k8snetworkplumbingwg", "version": "v3.9.3"}}}, "status": {"appliedStates": [{"name": "state-pod-security-policy", "state": "ignore"}, {"name": "state-multus-cni", "state": "notReady"}, {"name": "state-container-networking-plugins", "state": "notReady"}, {"name": "state-ipoib-cni", "state": "ignore"}, {"name": "state-whereabouts-cni", "state": "notReady"}, {"name": "state-OFED", "state": "notReady"}, {"name": "state-SRIOV-device-plugin", "state": "ignore"}, {"name": "state-RDMA-device-plugin", "state": "notReady"}, {"name": "state-NV-Peer", "state": "ignore"}, {"name": "state-ib-kubernetes", "state": "ignore"}, {"name": "state-nv-ipam-cni", "state": "ignore"}], "state": "notReady"}}]}
15:35:16 
15:35:16 TASK [projects-common/wait-k8s-resource : Wait Resource nicclusterpolicies to be ready] ***
15:35:16 task path: /jenkins/workspace/nic_operator_helm-ci/ansible/roles/projects-common/wait-k8s-resource/tasks/main.yml:29
15:35:16 redirecting (type: modules) ansible.builtin.k8s_info to community.kubernetes.k8s_info
15:35:16 redirecting (type: modules) community.kubernetes.k8s_info to kubernetes.core.k8s_info
15:35:17 FAILED - RETRYING: Wait Resource nicclusterpolicies to be ready (60 retries left).
15:35:27 redirecting (type: modules) ansible.builtin.k8s_info to community.kubernetes.k8s_info
15:35:27 redirecting (type: modules) community.kubernetes.k8s_info to kubernetes.core.k8s_info
15:35:28 FAILED - RETRYING: Wait Resource nicclusterpolicies to be ready (59 retries left).

@adrianchiris
Copy link
Collaborator

comparing a successfull run of another pr i can see that when getting node information feature.node.kubernetes.io/pci-15b3.present: "true" label is present but going through logs of the CI job of this PR the label is missing.

@@ -18,6 +18,7 @@

nfd:
enabled: true
nodeFeatureRules: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: where is this being used ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GFD (from the GPU feature discovery) will start using it soon, for this release we have it optional, but for the next one we plan to change it to enabled by default

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section is not propagated to node-feature-discovery chart. and currently we dont deploy anything related to GFD via helm.

should this be under node-feature-discovery ? (although i didnt find this parameter in node-feature-discovery chart valuses)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have enableNodeFeatureApi under node-feature-discovery section so i guess the parameter here is not needed ?

@adrianchiris
Copy link
Collaborator

adrianchiris commented Jul 2, 2023

i have taken a look at the CI issue. if we enable node feature rule feature in NFD need to enable node feature controller or crd-controller as its the entity which labels the node.

currently both are defaulted to null AND since we use the instance namespace it will not enable it by default.

https://github.com/kubernetes-sigs/node-feature-discovery/blob/10bbc8f253a3ebf4cb77bb5e7c2c910040065cf3/deployment/helm/node-feature-discovery/templates/master.yaml#L108C30-L108C40

also i saw there is a typo in the chart WRT crd-controller. see: https://github.com/kubernetes-sigs/node-feature-discovery/blob/10bbc8f253a3ebf4cb77bb5e7c2c910040065cf3/deployment/helm/node-feature-discovery/templates/master.yaml#L105

ill submit a PR to fix it.

@ArangoGutierrez
Copy link
Contributor Author

CI is green now! thanks for your help understanding the logs @adrianchiris
cc @rollandf

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez
Copy link
Contributor Author

/retest-nic_operator_kind

Copy link
Collaborator

@adrianchiris adrianchiris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! lets wait for CI to pass and can be merged.

@ArangoGutierrez
Copy link
Contributor Author

All green!!!

@adrianchiris adrianchiris merged commit b83f735 into Mellanox:master Jul 4, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants