You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There appears to be a bug with how labels are applied to notebook pods from the podDefault kubeflow resource.
How to reproduce
Schedule a protected-b notebook through the kubeflow UI on aaw-dev.
The pod fails to schedule with the following gatekeeper message:
Warning FailedCreate 8m (x11 over 13m) statefulset-controller create Pod protected-b-test-0 in StatefulSet protected-b-test failed error: admission webhook "validation.gatekeeper.sh" denied the request: [disk-data-classification] volume <workspace-protected-b-test> data classification <protected-b> conflicts with pod <protected-b-test-0> data classification <unclassified>
I tried again, this time removing the persistent volume option so that there is no volume mounted to the pod.
When I remove this option, the pod is scheduled successfully, but it is not scheduled to a protected-b node pool; it's scheduled one of the aks-useruc nodes instead.
Taking a look at the labels of the pod protected-b-test-2-0, the label notebook.statcan.gc.ca/protected-b: "true" is present, but there is no label called data.statcan.gc.ca/classification: protected-b.
Looking at the aaw-toleration-injector source code, the data.statcan.gc.ca/classification: protected-b toleration is created by checking if a corresponding pod label exists (see aaw-toleration-injector/mutate.go#131:155).
If this pod label never gets created, then line 140 will fail as the label does not exist on the pod, the else block will be executed, and the pod will get the toleration for unclassified, and subsequently get scheduled to the unclassified node pool.
I'm guessing there is some kind of mutating web-hook that translates instances of poddefault into modifications to the pod spec during admission control; and that this mutating webhook is currently failing (maybe admission-webhook-deployment in kubeflow namespace?). The cause may be something else too, I haven't investigated past this point.
note: If the above is correct, it would also explain why there was a gatekeeper error when I tried to attach a volume. If the pod gets assigned the label data.statcan.gc.ca/classification: unclassified, there is probably a gatekeeper policy flagging that a protected-b volume should not be mounted to an unclassified notebook. Of course, the protected-b notebook shouldn't be getting the data.statcan.gc.ca/classification: unclassified label in the first place, so this gatekeeper constraint shouldn't trigger if the label is data.statcan.gc.ca/classification: protected-b as it should be.
Edit: it looks like the admission-webhook-deployment's pod is experiencing a TLS handshake error (tls: bad certificate) - it's possible that the webhook is not running on admission control, and so these pods are not getting the appropriate labels added.
The text was updated successfully, but these errors were encountered:
Description
There appears to be a bug with how labels are applied to notebook pods from the
podDefault
kubeflow resource.How to reproduce
Schedule a protected-b notebook through the kubeflow UI on aaw-dev.
The pod fails to schedule with the following gatekeeper message:
I tried again, this time removing the persistent volume option so that there is no volume mounted to the pod.
When I remove this option, the pod is scheduled successfully, but it is not scheduled to a protected-b node pool; it's scheduled one of the
aks-useruc
nodes instead.Taking a look at the labels of the pod
protected-b-test-2-0
, the labelnotebook.statcan.gc.ca/protected-b: "true"
is present, but there is no label calleddata.statcan.gc.ca/classification: protected-b
.Looking at the
aaw-toleration-injector
source code, thedata.statcan.gc.ca/classification: protected-b
toleration is created by checking if a corresponding pod label exists (see aaw-toleration-injector/mutate.go#131:155).If this pod label never gets created, then line 140 will fail as the label does not exist on the pod, the
else
block will be executed, and the pod will get the toleration forunclassified
, and subsequently get scheduled to the unclassified node pool.It looks like the above should be achieved by the
kubeflowv1alpha1/poddefault
resource that is created the following lines of the aaw-kubeflow-profiles-controller/notebook.go#155:177.I'm guessing there is some kind of mutating web-hook that translates instances of
poddefault
into modifications to the pod spec during admission control; and that this mutating webhook is currently failing (maybeadmission-webhook-deployment
inkubeflow
namespace?). The cause may be something else too, I haven't investigated past this point.note: If the above is correct, it would also explain why there was a gatekeeper error when I tried to attach a volume. If the pod gets assigned the label
data.statcan.gc.ca/classification: unclassified
, there is probably a gatekeeper policy flagging that a protected-b volume should not be mounted to an unclassified notebook. Of course, the protected-b notebook shouldn't be getting thedata.statcan.gc.ca/classification: unclassified
label in the first place, so this gatekeeper constraint shouldn't trigger if the label isdata.statcan.gc.ca/classification: protected-b
as it should be.Edit: it looks like the
admission-webhook-deployment
's pod is experiencing a TLS handshake error (tls: bad certificate
) - it's possible that the webhook is not running on admission control, and so these pods are not getting the appropriate labels added.The text was updated successfully, but these errors were encountered: