Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Bug: scheduling protected-b notebooks to protected-b node pool on aaw-dev #1282

Open
Collinbrown95 opened this issue Jul 26, 2022 · 1 comment
Assignees
Labels
kind/bug Something isn't working triage/support

Comments

@Collinbrown95
Copy link
Contributor

Collinbrown95 commented Jul 26, 2022

Description

There appears to be a bug with how labels are applied to notebook pods from the podDefault kubeflow resource.

How to reproduce

Schedule a protected-b notebook through the kubeflow UI on aaw-dev.

image

The pod fails to schedule with the following gatekeeper message:

  Warning  FailedCreate  8m (x11 over 13m)   statefulset-controller  create Pod protected-b-test-0 in StatefulSet protected-b-test failed error: admission webhook "validation.gatekeeper.sh" denied the request: [disk-data-classification] volume <workspace-protected-b-test> data classification <protected-b> conflicts with pod <protected-b-test-0> data classification <unclassified>

I tried again, this time removing the persistent volume option so that there is no volume mounted to the pod.

image

When I remove this option, the pod is scheduled successfully, but it is not scheduled to a protected-b node pool; it's scheduled one of the aks-useruc nodes instead.

Taking a look at the labels of the pod protected-b-test-2-0, the label notebook.statcan.gc.ca/protected-b: "true" is present, but there is no label called data.statcan.gc.ca/classification: protected-b.

Looking at the aaw-toleration-injector source code, the data.statcan.gc.ca/classification: protected-b toleration is created by checking if a corresponding pod label exists (see aaw-toleration-injector/mutate.go#131:155).

If this pod label never gets created, then line 140 will fail as the label does not exist on the pod, the else block will be executed, and the pod will get the toleration for unclassified, and subsequently get scheduled to the unclassified node pool.

It looks like the above should be achieved by the kubeflowv1alpha1/poddefault resource that is created the following lines of the aaw-kubeflow-profiles-controller/notebook.go#155:177.

I'm guessing there is some kind of mutating web-hook that translates instances of poddefault into modifications to the pod spec during admission control; and that this mutating webhook is currently failing (maybe admission-webhook-deployment in kubeflow namespace?). The cause may be something else too, I haven't investigated past this point.

note: If the above is correct, it would also explain why there was a gatekeeper error when I tried to attach a volume. If the pod gets assigned the label data.statcan.gc.ca/classification: unclassified, there is probably a gatekeeper policy flagging that a protected-b volume should not be mounted to an unclassified notebook. Of course, the protected-b notebook shouldn't be getting the data.statcan.gc.ca/classification: unclassified label in the first place, so this gatekeeper constraint shouldn't trigger if the label is data.statcan.gc.ca/classification: protected-b as it should be.

Edit: it looks like the admission-webhook-deployment's pod is experiencing a TLS handshake error (tls: bad certificate) - it's possible that the webhook is not running on admission control, and so these pods are not getting the appropriate labels added.

@Souheil-Yazji
Copy link
Contributor

Issue does not seem to be live on prod, moving to backlog for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working triage/support
Projects
None yet
Development

No branches or pull requests

2 participants