Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Pod Failure Policy (K8 1.25 alpha feature) Can we utilize this in Armada? #1537

Closed
kannon92 opened this issue Sep 20, 2022 · 5 comments

Comments

@kannon92
Copy link
Contributor

In 1.25, the wg-batch/sig-scheduling has added an ability to retry Pods/Jobs based on error codes.

kubernetes/enhancements#3329

Docs: https://kubernetes.io/docs/tasks/job/pod-failure-policy/

We should investigate if we can use this to clean up our logic for executor pod retries.

Currently, we have code in our executor that does this in a non standard way. https://github.com/G-Research/armada/blob/master/internal/executor/configuration/podchecks/types.go

We should investigate if we can map some of this functionality to use Pod Failures.

@dave-gantenbein
Copy link
Member

@dave-gantenbein
Copy link
Member

@dejanzele dejanzele assigned dejanzele and unassigned dejanzele Oct 6, 2022
@dave-gantenbein
Copy link
Member

@kannon92
Copy link
Contributor Author

kannon92 commented Oct 20, 2022

I don't think this KEP will be helpful to Armada Executor as is.

In the KEP, their non goal is to support pending phase of pods. This means they are not targeting image pull issues or any pending issue that could cause the pod to not be scheduled.

I created an issue in the Kubernetes repo to start this conversation.

kubernetes/kubernetes#113211

The only outstanding question is if GR would be interesting in retrying based on exit codes for the containers?

@JamesMurkin any thoughts on this last point?

@kannon92
Copy link
Contributor Author

kannon92 commented Nov 2, 2022

@JamesMurkin mentions that the interest in retrying is usually based on Pending. There hasn't been much request on exit code retrying.

@kannon92 kannon92 closed this as completed Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants