-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JobSet v0.3.0 Bug] Webhook failing to start and somehow blocking regular Jobs not owned by JobSets from creating pods #361
Comments
/kind bug |
I think I see the issue. The mutating pod webhook is selecting all pods, and has no constraints on them being owned by a JobSet. We need to update it to add that constraint.
|
Hmm the thing is, the pod webhook has logic which skips pods that aren't owned by a JobSet.
It's just when defining a webhook you have to declare the object type it's for, which is pod in this case. So if the JobSet controller manager can't find a node to run on, but all pods in the cluster being created are trying to go through that webhook, this can happen? |
I think we may be able to add filters to the webhook |
This sounds like the issues @dejanzele and I have been seeing. I think something is up with the cert-rotator we use with newer versions of kind. I think its related to the flakiness in the e2e tests. So sometimes we are not able to deploy the jobset deployment and I see failures in kube-api-controller-manager around the webhooks for the pods. As you stated, the deployment is stuck and you see failures to the pod webhook. In our e2e tests you can see errors like this:
|
The pod webhook is selecting all pods, and has no constraints on them being owned by a JobSet in the webhook definition itself. We only filter pods not owned by a jobset in the Go code itself, in the admission stage of the webhook. So if the webhook is installed but unable to run anywhere, this will block all pods in the cluster from being created. |
We need to be able to add an objectSelector to the webhook to select pods with certain labels (indicating they are part of a JobSet), but kubebuilder annotations do not support this: https://book.kubebuilder.io/reference/markers/webhook |
Yes, I totally forgot about the objectSelector which we used to add manually during testing. We absolutely need to add it to avoid making the jobset operator a single point of failure for the whole cluster. |
I asked in the kubebuilder slack channel it sounds like there may be a way to inject the objectSelector using a kubebuilder JSON Patch. I'm looking into it now. |
This problem doesn't happen consistently, as I manually tested JobSet v0.3.0 multiple times successfully. However, a user today is running into a concerning error. After installing v0.3.0, the JobSet deployment manager fails to start up, then somehow when they submit a regular Job (not owned by a JobSet), the Job cannot create any pods due to an error calling the JobSet webhook (??).
Steps to reproduce:
Jobset controller manager not starting properly, no endpoints for webhook service:
kubectl describe rs -n jobset-system
Describing the Job after deploying it:
kubectl describe job tpu-job
The text was updated successfully, but these errors were encountered: