Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with Kueue #12363

Open
terrytangyuan opened this issue Dec 15, 2023 · 5 comments
Open

Integration with Kueue #12363

terrytangyuan opened this issue Dec 15, 2023 · 5 comments
Labels
area/controller Controller issues, panics area/spec Changes to the workflow specification. area/suspend-resume Suspending and resuming workflows type/feature Feature request

Comments

@terrytangyuan
Copy link
Member

Summary

Argo Workflows needs to implement necessary suspend mechanism to work with Kueue. See kubernetes-sigs/kueue#74 for more details.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

@alculquicondor
Copy link

Also relevant kubernetes/kubernetes#121681

@agilgur5 agilgur5 added area/controller Controller issues, panics area/spec Changes to the workflow specification. labels Dec 19, 2023
@agilgur5
Copy link
Member

I read the above two issues and I'm not sure what the next step would be for Argo here.

The current Workflow suspend spec is very similar to Job's suspend spec, so if that suffices for an API, I'm not sure what other changes would be needed.

The readinessGate is listed as an "alternative" proposal and still being refined upstream, so if that's not necessary, I don't quite see what's currently missing in the Argo spec. Can someone elaborate?

@agilgur5 agilgur5 added the area/suspend-resume Suspending and resuming workflows label Dec 21, 2023
@shuangkun
Copy link
Member

Maybe we can define a layer suspend mechanism (between workflow and steps) and estimate the total resources for next layer. When our quota is reserved, we will perform a one-layer resume.

@alculquicondor
Copy link

I think the next step is to take a step back to understand the following:

  • what are the possible integration points?
  • what exactly to suspend (the entire workflow, the layer?).
  • if layers, should we enqueue each layer only when they are ready to run, or should they be enqueued at the beginning, but not start until dependencies are met? This is where Job readiness gates kubernetes/kubernetes#121681 would help, if a layer is a k8s Job or it has an equivalent representation.

Note that Kueue works best when there is a CRD that represents the unit of queueuing.

@KunWuLuan
Copy link

KunWuLuan commented Apr 26, 2024

Note that Kueue works best when there is a CRD that represents the unit of queueuing.

I am thinking about this too. If there is no CRD that represents the unit of queueuing for every step, we may need to suspend the whole argo workflow, which is hard to estimate the resources needed. I prefer to let users to choose when to suspend the workflow by adding a suspend template as they do now.

Maybe add a property to indicate required podsets to run when suspended is enough. In this way, when workflow is suspended, we can create a workload with the required podsets in workflow status and resume the workflow when workload is admitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/spec Changes to the workflow specification. area/suspend-resume Suspending and resuming workflows type/feature Feature request
Projects
None yet
Development

No branches or pull requests

5 participants