Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[coscheduling] support match policy for podgroup #560

Closed
KunWuLuan opened this issue Mar 29, 2023 · 21 comments
Closed

[coscheduling] support match policy for podgroup #560

KunWuLuan opened this issue Mar 29, 2023 · 21 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@KunWuLuan
Copy link
Contributor

For GPU topology, all my tasks need to be scheduled at the same time to ensure the best performance, so I hope that gang can only be satisfied when the number of pods waiting on permit is larger than the min-available.
For task of elastic training, the new workers'' arrival is random. So I hope the gang can be satisfied when the total number of the running pods and waiting pods is large than the min-available.
In some specific scenarios, the completed task Pod may not exist due to node recycling. At this time, it is impossible to count the number of completed tasks. In this case, OnceResourceSatisfied can be used.

We can use scheduling.sigs.x-k8s.io/pod-group-matchpolicy to declare which pods should be considered in permit.

@KunWuLuan
Copy link
Contributor Author

/assign

@Huang-Wei
Copy link
Contributor

We can use scheduling.sigs.x-k8s.io/pod-group-matchpolicy to declare which pods should be considered in permit.

IMO a field in PodGroup API is a better place?

cc @denkensk

@KunWuLuan
Copy link
Contributor Author

A new property in spec is also feasible I think

@denkensk
Copy link
Member

So I hope the gang can be satisfied when the total number of the running pods and waiting pods is large than the min-available.

Isn't that what we have now?

scheduling.sigs.x-k8s.io/pod-group-matchpolicy

What value should we fill in? @KunWuLuan

@KunWuLuan
Copy link
Contributor Author

Isn't that what we have now?

yes, we have already support this case, and I think this can be the default behavior

What value should we fill in? @KunWuLuan

I think we can support three values for three cases I mentioned above, like: "only-waiting", "waiting-and-running" and "oncesatisfied".

Or we can let user choose the pod status they need, like: "scheduled", "running", "succeed", "failed", "oncesatisfied". For example, they can do nothing or use "scheduling.sigs.x-k8s.io/pod-group-matchpolicy"="scheduled,running" to achieve the same behavior as what we have now. @denkensk

@denkensk
Copy link
Member

denkensk commented Apr 3, 2023

I think we can support three values for three cases I mentioned above, like: "only-waiting", "waiting-and-running" and "oncesatisfied".

Or we can let user choose the pod status they need, like: "scheduled", "running", "succeed", "failed", "oncesatisfied". For example, they can do nothing or use "scheduling.sigs.x-k8s.io/pod-group-matchpolicy"="scheduled,running" to achieve the same behavior as what we have now.

Can you give more details about your scenario? It will help me understand what you want cleanly. As I know, for distributed training like in PyTorch, we need the running + waiting enough. Why do we need to include successd and failed@KunWuLuan

@KunWuLuan
Copy link
Contributor Author

For example, in asynchronous training with DLRover, if a worker failed in training, maybe because of OOM or other reason, Etjob-operator will run a new worker to continue training. And if there are some workers who have complete the training job, the new worker will be blocked in Permit. In this case, we may need to include succeed pod.

In tfjob training process, if any of the workers failed, we must delete all workers and rebuild them to continue training, and new coming workers may come before the workers being deleted. If we use running + waiting, new coming workers can run directly, while I think new coming workers should also be blocked in Permit.

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

FYI @RongbinZ @Hanyu96 @Xiaoaier-Z-L Do we also have these issues in the asynchronous training?

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

For example, in asynchronous training with DLRover, if a worker failed in training, maybe because of OOM or other reason, Etjob-operator will run a new worker to continue training. And if there are some workers who have complete the training job, the new worker will be blocked in Permit. In this case, we may need to include succeed pod.

In tfjob training process, if any of the workers failed, we must delete all workers and rebuild them to continue training, and new coming workers may come before the workers being deleted. If we use running + waiting, new coming workers can run directly, while I think new coming workers should also be blocked in Permit.

Thanks for your explanation. These scenarios are mainly focused on the part of failure recovery. But I was wondering if we need to keep coscheduling in effect in failure recovery. @Huang-Wei WDYT? Because most of the pods are scheduled. coscheduling has been less effective in saving resources.

For the asynchronous training part, you said and if there are some workers who have complete the training job, the new worker will be blocked in Permit, I understand it. But I wonder if you still need coscheduling when your training job can be launched when the total numbers are less than min numbers?

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

In tfjob training process, if any of the workers failed, we must delete all workers and rebuild them to continue training, and new coming workers may come before the workers being deleted.

Do you mean Is there an issue of untimely state synchronization here?

And Will you delete all workers by yourself and recreate it directly in the job? Why not resubmit a new one to continue training? @KunWuLuan

@Huang-Wei
Copy link
Contributor

For example, in asynchronous training with DLRover, if a worker failed in training, maybe because of OOM or other reason, Etjob-operator will run a new worker to continue training. And if there are some workers who have complete the training job, the new worker will be blocked in Permit. In this case, we may need to include succeed pod.

This sounds reasonable as succeeded Pods should be treated as completed and thus get deducted from minMember. But @denkensk's point is also valid: if the operator decides to spawn replacement pod(s) only, it may leave the job in an incomplete state b/c the running and replacement pods are no longer a gang.

In a real-world case, unknown pod failure is not uncommon. So providing an additional option to count succeeded pods is not a bad idea - at least it's not worse than current impl.. We can treat it as a remedy for cases that opeartor choose to backfill replacement pods. @denkensk WDYT?

In tfjob training process, if any of the workers failed, we must delete all workers and rebuild them to continue training, and new coming workers may come before the workers being deleted. If we use running + waiting, new coming workers can run directly, while I think new coming workers should also be blocked in Permit.

In this case, is the intention to delete all workers but keeping PS pod? and is PS and Worker pods are treated as a PodGroup? or, only Worker pods are treated as a PodGroup?

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

In this case, is the intention to delete all workers but keeping PS pod? and is PS and Worker pods are treated as a PodGroup? or, only Worker pods are treated as a PodGroup?

I think ps + worker both belong to the same pod group.

@KunWuLuan
Copy link
Contributor Author

I think ps + worker both belong to the same pod group.

Thanks for replying 😄. Yes, currently they both belong to the same pod group

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

In a real-world case, unknown pod failure is not uncommon. So providing an additional option to count succeeded pods is not a bad idea - at least it's not worse than current impl.. We can treat it as a remedy for cases that opeartor choose to backfill replacement pods. @denkensk WDYT?

It's reasonable. But we can analyze the problem in depth.

If you include succeeded pods in min number, does this mean your asynchronous training job can launch and start even if the count of workers is less than min number, because the succeeded pods cannot join the distributed training, right? @KunWuLuan

@KunWuLuan
Copy link
Contributor Author

Do you mean Is there an issue of untimely state synchronization here?

And Will you delete all workers by yourself and recreate it directly in the job? Why not resubmit a new one to continue training? @KunWuLuan

Apologies, I didn't explain it clearly 😄. Currently, if any of the workers in tfjob failed, the whole job will be marked failed and we need to submit a new tfjob to restart training like you said.
If in the future we had some automated pod reconstruction methods, they will only care about rebuilding the pods, and no need to care about how to build a none-existing pod group. WDYT?

@denkensk
Copy link
Member

denkensk commented Apr 4, 2023

If in the future we had some automated pod reconstruction methods, they will only care about rebuilding the pods, and no need to care about how to build a none-existing pod group. WDYT?

Thanks for your explanation. @KunWuLuan
I am not sure if I understand the reconstruction methods clearly and please correct me if I misunderstand. Do you mean you want to reuse the existing pod group when you restart the failed training job and don't want to recreate the related pod group again?

@KunWuLuan
Copy link
Contributor Author

KunWuLuan commented Apr 12, 2023

Do you mean you want to reuse the existing pod group when you restart the failed training job and don't want to recreate the related pod group again?

Yes. I think deleting and rebuilding a resource with the same name is a duplicate action. And Reusing an existing podgroup can preserve some historical information. @denkensk
If they can ensure that the previous pods have been deleted before rebuilding them, using the current version Pod group can also meet the requirements. 🤣

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 11, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants