[coscheduling] support match policy for podgroup #560

KunWuLuan · 2023-03-29T03:31:38Z

For GPU topology, all my tasks need to be scheduled at the same time to ensure the best performance, so I hope that gang can only be satisfied when the number of pods waiting on permit is larger than the min-available.
For task of elastic training, the new workers'' arrival is random. So I hope the gang can be satisfied when the total number of the running pods and waiting pods is large than the min-available.
In some specific scenarios, the completed task Pod may not exist due to node recycling. At this time, it is impossible to count the number of completed tasks. In this case, OnceResourceSatisfied can be used.

We can use scheduling.sigs.x-k8s.io/pod-group-matchpolicy to declare which pods should be considered in permit.

The text was updated successfully, but these errors were encountered:

KunWuLuan · 2023-03-29T03:52:47Z

/assign

Huang-Wei · 2023-03-30T18:07:32Z

We can use scheduling.sigs.x-k8s.io/pod-group-matchpolicy to declare which pods should be considered in permit.

IMO a field in PodGroup API is a better place?

cc @denkensk

KunWuLuan · 2023-03-31T03:18:42Z

A new property in spec is also feasible I think

denkensk · 2023-03-31T03:24:07Z

So I hope the gang can be satisfied when the total number of the running pods and waiting pods is large than the min-available.

Isn't that what we have now?

scheduling.sigs.x-k8s.io/pod-group-matchpolicy

What value should we fill in？ @KunWuLuan

KunWuLuan · 2023-03-31T06:10:02Z

Isn't that what we have now?

yes, we have already support this case, and I think this can be the default behavior

What value should we fill in？ @KunWuLuan

I think we can support three values for three cases I mentioned above, like: "only-waiting", "waiting-and-running" and "oncesatisfied".

Or we can let user choose the pod status they need, like: "scheduled", "running", "succeed", "failed", "oncesatisfied". For example, they can do nothing or use "scheduling.sigs.x-k8s.io/pod-group-matchpolicy"="scheduled,running" to achieve the same behavior as what we have now. @denkensk

denkensk · 2023-04-03T06:49:34Z

I think we can support three values for three cases I mentioned above, like: "only-waiting", "waiting-and-running" and "oncesatisfied".

Or we can let user choose the pod status they need, like: "scheduled", "running", "succeed", "failed", "oncesatisfied". For example, they can do nothing or use "scheduling.sigs.x-k8s.io/pod-group-matchpolicy"="scheduled,running" to achieve the same behavior as what we have now.

Can you give more details about your scenario？ It will help me understand what you want cleanly. As I know, for distributed training like in PyTorch, we need the running + waiting enough. Why do we need to include successd and failed？ @KunWuLuan

KunWuLuan · 2023-04-04T03:18:44Z

For example, in asynchronous training with DLRover, if a worker failed in training, maybe because of OOM or other reason, Etjob-operator will run a new worker to continue training. And if there are some workers who have complete the training job, the new worker will be blocked in Permit. In this case, we may need to include succeed pod.

In tfjob training process, if any of the workers failed, we must delete all workers and rebuild them to continue training, and new coming workers may come before the workers being deleted. If we use running + waiting, new coming workers can run directly, while I think new coming workers should also be blocked in Permit.

denkensk · 2023-04-04T03:55:54Z

FYI @RongbinZ @Hanyu96 @Xiaoaier-Z-L Do we also have these issues in the asynchronous training?

denkensk · 2023-04-04T04:06:25Z

For example, in asynchronous training with DLRover, if a worker failed in training, maybe because of OOM or other reason, Etjob-operator will run a new worker to continue training. And if there are some workers who have complete the training job, the new worker will be blocked in Permit. In this case, we may need to include succeed pod.

In tfjob training process, if any of the workers failed, we must delete all workers and rebuild them to continue training, and new coming workers may come before the workers being deleted. If we use running + waiting, new coming workers can run directly, while I think new coming workers should also be blocked in Permit.

Thanks for your explanation. These scenarios are mainly focused on the part of failure recovery. But I was wondering if we need to keep coscheduling in effect in failure recovery. @Huang-Wei WDYT? Because most of the pods are scheduled. coscheduling has been less effective in saving resources.

For the asynchronous training part, you said and if there are some workers who have complete the training job, the new worker will be blocked in Permit, I understand it. But I wonder if you still need coscheduling when your training job can be launched when the total numbers are less than min numbers?

denkensk · 2023-04-04T04:11:58Z

In tfjob training process, if any of the workers failed, we must delete all workers and rebuild them to continue training, and new coming workers may come before the workers being deleted.

Do you mean Is there an issue of untimely state synchronization here?

And Will you delete all workers by yourself and recreate it directly in the job? Why not resubmit a new one to continue training? @KunWuLuan

Huang-Wei · 2023-04-04T05:54:26Z

For example, in asynchronous training with DLRover, if a worker failed in training, maybe because of OOM or other reason, Etjob-operator will run a new worker to continue training. And if there are some workers who have complete the training job, the new worker will be blocked in Permit. In this case, we may need to include succeed pod.

This sounds reasonable as succeeded Pods should be treated as completed and thus get deducted from minMember. But @denkensk's point is also valid: if the operator decides to spawn replacement pod(s) only, it may leave the job in an incomplete state b/c the running and replacement pods are no longer a gang.

In a real-world case, unknown pod failure is not uncommon. So providing an additional option to count succeeded pods is not a bad idea - at least it's not worse than current impl.. We can treat it as a remedy for cases that opeartor choose to backfill replacement pods. @denkensk WDYT?

In tfjob training process, if any of the workers failed, we must delete all workers and rebuild them to continue training, and new coming workers may come before the workers being deleted. If we use running + waiting, new coming workers can run directly, while I think new coming workers should also be blocked in Permit.

In this case, is the intention to delete all workers but keeping PS pod? and is PS and Worker pods are treated as a PodGroup? or, only Worker pods are treated as a PodGroup?

denkensk · 2023-04-04T06:11:06Z

In this case, is the intention to delete all workers but keeping PS pod? and is PS and Worker pods are treated as a PodGroup? or, only Worker pods are treated as a PodGroup?

I think ps + worker both belong to the same pod group.

KunWuLuan · 2023-04-04T06:13:07Z

I think ps + worker both belong to the same pod group.

Thanks for replying 😄. Yes, currently they both belong to the same pod group

denkensk · 2023-04-04T06:18:31Z

In a real-world case, unknown pod failure is not uncommon. So providing an additional option to count succeeded pods is not a bad idea - at least it's not worse than current impl.. We can treat it as a remedy for cases that opeartor choose to backfill replacement pods. @denkensk WDYT?

It's reasonable. But we can analyze the problem in depth.

If you include succeeded pods in min number, does this mean your asynchronous training job can launch and start even if the count of workers is less than min number, because the succeeded pods cannot join the distributed training, right? @KunWuLuan

KunWuLuan · 2023-04-04T06:23:46Z

Do you mean Is there an issue of untimely state synchronization here?

And Will you delete all workers by yourself and recreate it directly in the job? Why not resubmit a new one to continue training? @KunWuLuan

Apologies, I didn't explain it clearly 😄. Currently, if any of the workers in tfjob failed, the whole job will be marked failed and we need to submit a new tfjob to restart training like you said.
If in the future we had some automated pod reconstruction methods, they will only care about rebuilding the pods, and no need to care about how to build a none-existing pod group. WDYT?

denkensk · 2023-04-04T06:29:22Z

If in the future we had some automated pod reconstruction methods, they will only care about rebuilding the pods, and no need to care about how to build a none-existing pod group. WDYT?

Thanks for your explanation. @KunWuLuan
I am not sure if I understand the reconstruction methods clearly and please correct me if I misunderstand. Do you mean you want to reuse the existing pod group when you restart the failed training job and don't want to recreate the related pod group again?

KunWuLuan · 2023-04-12T07:15:10Z

Do you mean you want to reuse the existing pod group when you restart the failed training job and don't want to recreate the related pod group again?

Yes. I think deleting and rebuilding a resource with the same name is a duplicate action. And Reusing an existing podgroup can preserve some historical information. @denkensk
If they can ensure that the previous pods have been deleted before rebuilding them, using the current version Pod group can also meet the requirements. 🤣

k8s-triage-robot · 2023-07-11T07:35:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-01-19T16:01:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-02-18T16:58:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-02-18T16:58:19Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot assigned KunWuLuan Mar 29, 2023

KunWuLuan mentioned this issue Jun 13, 2023

[proposal] Coscheduling Support multiple Gang match policy koordinator-sh/koordinator#1379

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 18, 2024

KunWuLuan mentioned this issue Apr 12, 2024

cache assigned pod count #708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[coscheduling] support match policy for podgroup #560

[coscheduling] support match policy for podgroup #560

KunWuLuan commented Mar 29, 2023

KunWuLuan commented Mar 29, 2023

Huang-Wei commented Mar 30, 2023

KunWuLuan commented Mar 31, 2023

denkensk commented Mar 31, 2023

KunWuLuan commented Mar 31, 2023

denkensk commented Apr 3, 2023

KunWuLuan commented Apr 4, 2023

denkensk commented Apr 4, 2023

denkensk commented Apr 4, 2023

denkensk commented Apr 4, 2023

Huang-Wei commented Apr 4, 2023

denkensk commented Apr 4, 2023

KunWuLuan commented Apr 4, 2023

denkensk commented Apr 4, 2023 •

edited

Loading

KunWuLuan commented Apr 4, 2023

denkensk commented Apr 4, 2023

KunWuLuan commented Apr 12, 2023 •

edited

Loading

k8s-triage-robot commented Jul 11, 2023

k8s-triage-robot commented Jan 19, 2024

k8s-triage-robot commented Feb 18, 2024

k8s-ci-robot commented Feb 18, 2024

[coscheduling] support match policy for podgroup #560

[coscheduling] support match policy for podgroup #560

Comments

KunWuLuan commented Mar 29, 2023

KunWuLuan commented Mar 29, 2023

Huang-Wei commented Mar 30, 2023

KunWuLuan commented Mar 31, 2023

denkensk commented Mar 31, 2023

KunWuLuan commented Mar 31, 2023

denkensk commented Apr 3, 2023

KunWuLuan commented Apr 4, 2023

denkensk commented Apr 4, 2023

denkensk commented Apr 4, 2023

denkensk commented Apr 4, 2023

Huang-Wei commented Apr 4, 2023

denkensk commented Apr 4, 2023

KunWuLuan commented Apr 4, 2023

denkensk commented Apr 4, 2023 • edited Loading

KunWuLuan commented Apr 4, 2023

denkensk commented Apr 4, 2023

KunWuLuan commented Apr 12, 2023 • edited Loading

k8s-triage-robot commented Jul 11, 2023

k8s-triage-robot commented Jan 19, 2024

k8s-triage-robot commented Feb 18, 2024

k8s-ci-robot commented Feb 18, 2024

denkensk commented Apr 4, 2023 •

edited

Loading

KunWuLuan commented Apr 12, 2023 •

edited

Loading