-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cant get mpijob status when pod template is invalid #604
Comments
The controller currently the item when there are errors during worker pod creation. It might be problematic to requeue regardless what the error is. If there are pod spec errors, then it should just fail instead of requeueing repeatedly. https://github.com/kubeflow/mpi-operator/blob/master/pkg/controller/mpi_job_controller.go#L964 |
Ideally, we should have a webhook, but this was never prioritized. Alternatively, we can add a CEL validator https://kubernetes.io/docs/reference/access-authn-authz/validating-admission-policy/#validation-expression Happy to review a PR if you are interested in working on it. |
Previously, I have tried to introduce CEL validation to the traininig-operator: kubeflow/training-operator#1708 However, I gave up introducing it since it is hard to validate podTemplate due to the cost budget of the CEL validations. kubeflow/training-operator#1708 (comment) Hence, we must introduce webhooks if we want to validate the podTemplates. |
You mean that CEL was too slow or what exactly? |
No, I meant CEL validation can not work due to the following errors:
This was caused by cost budget. |
Oh, so too many validation rules :) |
I guess that these exceedings are caused by replicaSpecs are defined by |
Ah, we shot ourselves in the foot by using a map instead of explicit fields. |
#606 |
i created a mpijob with invalid pod template , i cant get mpijob status all the time ( i think the status should be Failed).
now i cant distinguish the mpijobs which are too new to get status and the mpijobs with invaild pod template
my mpijob shows below
kubectl get mpijob ai62da0dbe-6406-4252-85d6-51ef87eab10d -n cpod -oyaml
the output is :
when describe the mpijob
output is :
The text was updated successfully, but these errors were encountered: