MicroShift 4.15+ support with helm charts #745

arthur-r-oliveira · 2024-05-27T15:12:04Z

As follow-up from previous PR #702, testing MicroShift 4.15+ integration through helm charts, instead of legacy static manifests, fails due lack of Pod Security Standards configurations:

[root@lenovo-p620-01 k8s-device-plugin]# microshift version
MicroShift Version: 4.15.13
Base OCP Version: 4.15.13
[root@lenovo-p620-01 k8s-device-plugin]# oc get nodes
NAME                                        STATUS   ROLES                         AGE   VERSION
lenovo-p620-01.khw.eng.lab.local   Ready    control-plane,master,worker   10d   v1.28.9
[root@lenovo-p620-01 k8s-device-plugin]# 

[root@lenovo-p620-01 k8s-device-plugin]# git branch -a
* main
  remotes/origin/HEAD -> origin/main
  remotes/origin/feature/microshift_timeslicing
  remotes/origin/main



[root@lenovo-p620-01 k8s-device-plugin]# oc label node lenovo-p620-01.khw.eng.lab.local --overwrite nvidia.com/gpu.present=true
node/lenovo-p620-01.khw.eng.lab.local labeled
[root@lenovo-p620-01 k8s-device-plugin]# oc get nodes --show-labels
NAME                                        STATUS   ROLES                         AGE   VERSION   LABELS
lenovo-p620-01.khw.eng.lab.local   Ready    control-plane,master,worker   10d   v1.28.9   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=lenovo-p620-01.khw.eng.lab.local,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node-role.kubernetes.io/worker=,node.openshift.io/os_id=rhel,nvidia.com/gpu.present=true,topology.topolvm.io/node=lenovo-p620-01.khw.eng.lab.local

[root@lenovo-p620-01 k8s-device-plugin]# cat << EOF > /tmp/dp-example-config0.yaml
version: v1
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 10
EOF

[root@lenovo-p620-01 k8s-device-plugin]# cat /tmp/dp-example-config0.yaml
version: v1
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 10


[root@lenovo-p620-01 k8s-device-plugin]#  helm upgrade -i nvdp deployments/helm/nvidia-device-plugin/     --version=0.15.0     --namespace nvidia-device-plugin     --create-namespace     --set-file config.map.config=/tmp/dp-example-config0.yaml
Release "nvdp" does not exist. Installing it now.
W0527 10:11:17.712709  924228 warnings.go:70] would violate PodSecurity "restricted:v1.24": privileged (containers "mps-control-daemon-mounts", "mps-control-daemon-ctr" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (containers "mps-control-daemon-mounts", "mps-control-daemon-init", "mps-control-daemon-sidecar", "mps-control-daemon-ctr" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "mps-control-daemon-mounts", "mps-control-daemon-init", "mps-control-daemon-sidecar", "mps-control-daemon-ctr" must set securityContext.capabilities.drop=["ALL"]), restricted volume types (volumes "mps-root", "mps-shm" use restricted volume type "hostPath"), runAsNonRoot != true (pod or containers "mps-control-daemon-mounts", "mps-control-daemon-init", "mps-control-daemon-sidecar", "mps-control-daemon-ctr" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "mps-control-daemon-mounts", "mps-control-daemon-init", "mps-control-daemon-sidecar", "mps-control-daemon-ctr" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
W0527 10:11:17.715342  924228 warnings.go:70] would violate PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (containers "nvidia-device-plugin-init", "nvidia-device-plugin-sidecar", "nvidia-device-plugin-ctr" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers "nvidia-device-plugin-init", "nvidia-device-plugin-sidecar", "nvidia-device-plugin-ctr" must set securityContext.capabilities.drop=["ALL"]; containers "nvidia-device-plugin-sidecar", "nvidia-device-plugin-ctr" must not include "SYS_ADMIN" in securityContext.capabilities.add), restricted volume types (volumes "device-plugin", "mps-root", "mps-shm", "cdi-root" use restricted volume type "hostPath"), runAsNonRoot != true (pod or containers "nvidia-device-plugin-init", "nvidia-device-plugin-sidecar", "nvidia-device-plugin-ctr" must set securityContext.runAsNonRoot=true), seccompProfile (pod or containers "nvidia-device-plugin-init", "nvidia-device-plugin-sidecar", "nvidia-device-plugin-ctr" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
NAME: nvdp
LAST DEPLOYED: Mon May 27 10:11:17 2024
NAMESPACE: nvidia-device-plugin
STATUS: deployed
REVISION: 1
TEST SUITE: None

…e and create new Role for allowing nvdp to run with MicroShift 4.15+ Signed-off-by: Arthur Oliveira <arolivei@redhat.com>

…le-binding templates Signed-off-by: Arthur Oliveira <arolivei@redhat.com>

… new var Signed-off-by: Arthur Oliveira <arolivei@redhat.com>

arthur-r-oliveira · 2024-05-27T15:14:21Z

@elezar testing helm chart with small changes for microshift 4.15+ support.
Does this PR make sense for you as well?

[root@lenovo-p620-01 k8s-device-plugin]# git status
On branch feature/microshift_timeslicing
Your branch is up to date with 'origin/feature/microshift_timeslicing'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   deployments/helm/nvidia-device-plugin/values.yaml

no changes added to commit (use "git add" and/or "git commit -a")
[root@lenovo-p620-01 k8s-device-plugin]# tail -2 deployments/helm/nvidia-device-plugin/values.yaml
# microshift: "enabled"
microshift: "enabled"[root@lenovo-p620-01 k8s-device-plugin]# microshift version
MicroShift Version: 4.15.13
Base OCP Version: 4.15.13
[root@lenovo-p620-01 k8s-device-plugin]# oc get nodes
NAME                                        STATUS   ROLES                         AGE   VERSION
lenovo-p620-01.khw.eng.lab.local   Ready    control-plane,master,worker   10d   v1.28.9
[root@lenovo-p620-01 k8s-device-plugin]# oc label node lenovo-p620-01.khw.eng.lab.local --overwrite nvidia.com/gpu.present=true
node/lenovo-p620-01.khw.eng.lab.local not labeled
[root@lenovo-p620-01 k8s-device-plugin]#  helm upgrade -i nvdp deployments/helm/nvidia-device-plugin/     --version=0.15.0     --namespace nvidia-device-plugin     --create-namespace     --set-file config.map.config=/tmp/dp-example-config0.yaml
Release "nvdp" has been upgraded. Happy Helming!
NAME: nvdp
LAST DEPLOYED: Mon May 27 11:12:58 2024
NAMESPACE: nvidia-device-plugin
STATUS: deployed
REVISION: 2
TEST SUITE: None

[root@lenovo-p620-01 k8s-device-plugin]# oc get pods

NAME                              READY   STATUS    RESTARTS   AGE
nvdp-nvidia-device-plugin-rnpkt   2/2     Running   0          3m56s
[root@lenovo-p620-01 k8s-device-plugin]# 
[root@lenovo-p620-01 k8s-device-plugin]# oc get node -o json | jq -r '.items[0].status.capacity'
{
  "cpu": "24",
  "ephemeral-storage": "225245Mi",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "65404256Ki",
  "nvidia.com/gpu": "10",
  "pods": "250"
}

@fabiendupont FYI

ArangoGutierrez

Is there a major difference between microshift and openshift? we don't have Helm values for Openshift. Are you not deploying the device plugin via the operator?

arthur-r-oliveira · 2024-05-27T16:15:09Z

@ArangoGutierrez thanks reviewing this issue.

In addition to standard Kubernetes APIs, MicroShift includes just a small subset of the APIs supported by OpenShift Container Platform:

Route / route.openshift.io/v1
SecurityContextConstraints / security.openshift.io/v1

See more with:
https://access.redhat.com/documentation/en-us/red_hat_build_of_microshift/4.15/html/getting_started/microshift-architecture#microshift-differences-oke_microshift-architecture and https://docs.redhat.com/en/documentation/openshift_container_platform/4.15/html/architecture/nvidia-gpu-architecture-overview#nvidia-gpu-enablement_nvidia-gpu-architecture-overview

For OpenShift, we've working so far with the NVIDIA Operator, due the full enterprise nature of OpenShift (cluster-operators based + native OLM).

But for MicroShift, although seems to work fine with NVIDIA's Operator as well, we are looking for resource-constrained environments and also to support it with RPM based systems (not only ostree ones). Due that, the focus is to stay compatible with as minimum footprint as possible like just having nvidia-device-plugin.

With this PR, I'm basically extending this existent doc https://docs.nvidia.com/datacenter/cloud-native/edge/latest/nvidia-gpu-with-device-edge.html# with helm charts while also testing time-slicing. Although this changes can also benefit OpenShift deployments, the NVIDIA Operator already covers well that use case and the focus here stays with MicroShift only.

deployments/helm/nvidia-device-plugin/templates/role-binding.yml

… new var with values.yaml Signed-off-by: Arthur Oliveira <arolivei@redhat.com>

arthur-r-oliveira · 2024-05-31T15:16:54Z

Is there a major difference between microshift and openshift? we don't have Helm values for Openshift. Are you not deploying the device plugin via the operator?

See [1] for a major differences on MicroShift and OpenShift and [2] about simplifying the PR. with latest changes as per recommended by @elezar, we don't need to have extra vars which makes it transparent for users.

elezar · 2024-05-31T15:20:57Z

deployments/helm/nvidia-device-plugin/values.yaml

@@ -149,4 +149,4 @@ mps:
  # be created. This includes a daemon-specific /dev/shm and pipe and log
  # directories.
  # Pipe directories will be created at {{ mps.root }}/{{ .ResourceName }}
-  root: "/run/nvidia/mps"
+  root: "/run/nvidia/mps"


This seems like an oversight due to the removal of the explicit value.

sorry for this typo. at the end, better to remove this file from the PR. what do you think?

elezar

Thanks @arthur-r-oliveira. I think it looks a lot better now.

I have some minor comments / questions.

I would also like @cdesiniotis to just have a look before going ahead with merging this though.

elezar · 2024-05-31T15:21:50Z

deployments/helm/nvidia-device-plugin/templates/role.yml

+  {{- if .Capabilities.APIVersions.Has "security.openshift.io/v1/SecurityContextConstraints" }}
+  - apiGroups:
+      - security.openshift.io
+    resourceNames:
+      - privileged
+    resources:
+      - securitycontextconstraints
+    verbs:
+      - use
+  {{- end }}


Does it make sense to shift this to a named template to use here and below? Not a blocker though.

elezar · 2024-05-31T15:25:09Z

deployments/helm/nvidia-device-plugin/templates/role.yml

+{{- if .Capabilities.APIVersions.Has "security.openshift.io/v1/SecurityContextConstraints" }}
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role


Could we create a list consisting of ClusterRole and Role and loop over these to construct both? (note that for the default case we would only construct a ClusterRole.

I'm happy to do these as a follow-up.

arthur-r-oliveira added 3 commits May 27, 2024 10:32

Due current Pod Security standard, it was needed to Extend ClusterRol…

85545c8

…e and create new Role for allowing nvdp to run with MicroShift 4.15+ Signed-off-by: Arthur Oliveira <arolivei@redhat.com>

Adding microshift flag at values.yaml and conditions with role and ro…

f1b03e4

…le-binding templates Signed-off-by: Arthur Oliveira <arolivei@redhat.com>

Letting microshift flag at values.yaml as empty and comment about the…

4c7ccce

… new var Signed-off-by: Arthur Oliveira <arolivei@redhat.com>

arthur-r-oliveira changed the title ~~Feature/microshift timeslicing~~ MicroShift 4.15+ support with helm charts May 27, 2024

ArangoGutierrez requested changes May 27, 2024

View reviewed changes

arthur-r-oliveira requested a review from ArangoGutierrez May 28, 2024 07:45

elezar reviewed May 29, 2024

View reviewed changes

deployments/helm/nvidia-device-plugin/templates/role-binding.yml Outdated Show resolved Hide resolved

Using helm template option .Capabilities.APIVersions.Has instead of a…

3b32151

… new var with values.yaml Signed-off-by: Arthur Oliveira <arolivei@redhat.com>

elezar reviewed May 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MicroShift 4.15+ support with helm charts #745

MicroShift 4.15+ support with helm charts #745

arthur-r-oliveira commented May 27, 2024 •

edited

Loading

arthur-r-oliveira commented May 27, 2024 •

edited

Loading

ArangoGutierrez left a comment

arthur-r-oliveira commented May 27, 2024 •

edited

Loading

arthur-r-oliveira commented May 31, 2024

elezar May 31, 2024

arthur-r-oliveira Jun 7, 2024

elezar left a comment

elezar May 31, 2024

elezar May 31, 2024

MicroShift 4.15+ support with helm charts #745

Are you sure you want to change the base?

MicroShift 4.15+ support with helm charts #745

Conversation

arthur-r-oliveira commented May 27, 2024 • edited Loading

arthur-r-oliveira commented May 27, 2024 • edited Loading

ArangoGutierrez left a comment

Choose a reason for hiding this comment

arthur-r-oliveira commented May 27, 2024 • edited Loading

arthur-r-oliveira commented May 31, 2024

elezar May 31, 2024

Choose a reason for hiding this comment

arthur-r-oliveira Jun 7, 2024

Choose a reason for hiding this comment

elezar left a comment

Choose a reason for hiding this comment

elezar May 31, 2024

Choose a reason for hiding this comment

elezar May 31, 2024

Choose a reason for hiding this comment

arthur-r-oliveira commented May 27, 2024 •

edited

Loading

arthur-r-oliveira commented May 27, 2024 •

edited

Loading

arthur-r-oliveira commented May 27, 2024 •

edited

Loading