Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[coscheduling] Infinite scheduling loop of distributed tasks block all other pods #429

Closed
zhiyxu opened this issue Sep 21, 2022 · 13 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@zhiyxu
Copy link
Contributor

zhiyxu commented Sep 21, 2022

Recently, we encountered some infinite scheduling loop issues when using Coscheduling in our production environment.

Suppose the following scenario, there are 3 instances of the distributed task, and each instance requires 4 GPU resources. The resources in the current cluster are as follows:

Node Resource Left
node1 6GPU
node2 2GPU
node3 2GPU
node4 2GPU

Obviously, there is resource fragmentation problem in the cluster. Since the total resources of the cluster can satisfy 3 instances, instance will succeeded in the PreFilter stage, the scheduling process is as follows:

  • Instance 1 is successfully scheduled and wait in permit stage. Add Instance 2 and 3 to ActiveQueue
  • Instance 2 fails to schedule due to resource fragmentation, instance 1 will be rejected in the PostFilter stage
  • Instance 3 is successfully scheduled and wait in permit stage. Add instance 1 and 2 to ActiveQueue
  • Instance 1 fails to schedule, instance 3 will be rejected in the PostFilter stage
  • Instance 2 is successfully scheduled and wait in permit stage. Add Instance 1 and 3 to ActiveQueue
  • ......scheduling loop

Worse, whenever an instance is added to the ActiveQueue, according to the QueueSort extension point, the distributed instance will be queued in front of all later created Pods, and the above loop will block all other later created Pods, these pods will stuck in Pending state and don't have schedule opportunities at all.

In order to quickly solve the above problem, we removed the Queue extension point of Coscheduling and used the default PrioritySort plugin, but the scheduling loop will still block all lower priority tasks.

The root cause seems to be that minResource verification cannot fundamentally guarantee that all distributed instances can run normally, and whenever an instance is successfully scheduled in the Wait phase, all Sibling instances will be added to ActiveQueue through PodsToActivateKey.

@Huang-Wei
Copy link
Contributor

The root cause seems to be that minResource verification cannot fundamentally guarantee that all distributed instances can run normally

minResource is optional and pre-gates the group's res req in a best-efforts manner, so it's per design.

The root cause is two-folded:

  1. One lies in the current impl. of QueueSort. We need to queue the PodGroup-associated pods fairly, rather than using a static creationTimestamp. But unfortunately, we may not come up with a promising solution until changing the scheduler framework to provide an enqueue mechanics. I'm working on a KEP targeting 1.26.
  2. Secondly, the pod activation logic somewhat bypasses the backoff timer. Which works fine in most cases, as it immediately move the sibling pods to acitveQ so that within a timeout, they have more chances to be coscheduled. However, in the case you described, it caused a side effect to keep moving the group of pods back and forth. A solution for this is to add this PodGroup into a podGroupBackOff struct after PostFilter#L171, so that in your case instance 3 should fail in PreFilter (as the PodGroup is being backed off) instead of getted permitted.

@Huang-Wei Huang-Wei added the kind/bug Categorizes issue or PR as related to a bug. label Sep 22, 2022
@Huang-Wei
Copy link
Contributor

Note that the 2nd cause is not a regression of #408 - the DeniedPG wasn't involved in PostFilter either.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022
@Huang-Wei
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2023
@Huang-Wei
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2023
@KunWuLuan
Copy link
Contributor

@Huang-Wei Hi, if no one working on this, I can help to add podGroup backoff time.
/assign

@Huang-Wei
Copy link
Contributor

@KunWuLuan sure, appreciate that. And please write a test to reproduce the issue and ensure the fix can at least remediate this issue.

@KunWuLuan
Copy link
Contributor

thanks, I will do that. : )

@KunWuLuan
Copy link
Contributor

I have added the test for this case. 😄
You can review it when you have time, thanks. 😆

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2023
@KunWuLuan
Copy link
Contributor

#559 has been merged.
Can we close this issue now?

@Huang-Wei
Copy link
Contributor

Yes. thanks for the reminder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants