[coscheduling] Infinite scheduling loop of distributed tasks block all other pods #429

zhiyxu · 2022-09-21T07:12:51Z

Recently, we encountered some infinite scheduling loop issues when using Coscheduling in our production environment.

Suppose the following scenario, there are 3 instances of the distributed task, and each instance requires 4 GPU resources. The resources in the current cluster are as follows:

Node	Resource Left
node1	6GPU
node2	2GPU
node3	2GPU
node4	2GPU

Obviously, there is resource fragmentation problem in the cluster. Since the total resources of the cluster can satisfy 3 instances, instance will succeeded in the PreFilter stage, the scheduling process is as follows:

Instance 1 is successfully scheduled and wait in permit stage. Add Instance 2 and 3 to ActiveQueue
Instance 2 fails to schedule due to resource fragmentation, instance 1 will be rejected in the PostFilter stage
Instance 3 is successfully scheduled and wait in permit stage. Add instance 1 and 2 to ActiveQueue
Instance 1 fails to schedule, instance 3 will be rejected in the PostFilter stage
Instance 2 is successfully scheduled and wait in permit stage. Add Instance 1 and 3 to ActiveQueue
......scheduling loop

Worse, whenever an instance is added to the ActiveQueue, according to the QueueSort extension point, the distributed instance will be queued in front of all later created Pods, and the above loop will block all other later created Pods, these pods will stuck in Pending state and don't have schedule opportunities at all.

In order to quickly solve the above problem, we removed the Queue extension point of Coscheduling and used the default PrioritySort plugin, but the scheduling loop will still block all lower priority tasks.

The root cause seems to be that minResource verification cannot fundamentally guarantee that all distributed instances can run normally, and whenever an instance is successfully scheduled in the Wait phase, all Sibling instances will be added to ActiveQueue through PodsToActivateKey.

Huang-Wei · 2022-09-22T04:34:41Z

The root cause seems to be that minResource verification cannot fundamentally guarantee that all distributed instances can run normally

minResource is optional and pre-gates the group's res req in a best-efforts manner, so it's per design.

The root cause is two-folded:

One lies in the current impl. of QueueSort. We need to queue the PodGroup-associated pods fairly, rather than using a static creationTimestamp. But unfortunately, we may not come up with a promising solution until changing the scheduler framework to provide an enqueue mechanics. I'm working on a KEP targeting 1.26.
Secondly, the pod activation logic somewhat bypasses the backoff timer. Which works fine in most cases, as it immediately move the sibling pods to acitveQ so that within a timeout, they have more chances to be coscheduled. However, in the case you described, it caused a side effect to keep moving the group of pods back and forth. A solution for this is to add this PodGroup into a podGroupBackOff struct after PostFilter#L171, so that in your case instance 3 should fail in PreFilter (as the PodGroup is being backed off) instead of getted permitted.

Huang-Wei · 2022-09-22T04:37:13Z

Note that the 2nd cause is not a regression of #408 - the DeniedPG wasn't involved in PostFilter either.

k8s-triage-robot · 2022-12-21T05:10:12Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Huang-Wei · 2022-12-21T05:23:20Z

/remove-lifecycle stale

k8s-triage-robot · 2023-03-21T06:15:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Huang-Wei · 2023-03-22T23:52:56Z

/remove-lifecycle stale

KunWuLuan · 2023-03-30T03:27:29Z

@Huang-Wei Hi, if no one working on this, I can help to add podGroup backoff time.
/assign

Huang-Wei · 2023-03-30T18:03:06Z

@KunWuLuan sure, appreciate that. And please write a test to reproduce the issue and ensure the fix can at least remediate this issue.

KunWuLuan · 2023-03-31T01:38:55Z

thanks, I will do that. : )

KunWuLuan · 2023-03-31T03:03:57Z

I have added the test for this case. 😄
You can review it when you have time, thanks. 😆

k8s-triage-robot · 2023-06-29T03:31:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

KunWuLuan · 2023-09-12T10:08:14Z

#559 has been merged.
Can we close this issue now?

Huang-Wei · 2023-09-12T20:23:46Z

Yes. thanks for the reminder.

Huang-Wei added the kind/bug Categorizes issue or PR as related to a bug. label Sep 22, 2022

Huang-Wei mentioned this issue Sep 29, 2022

[Coscheduling] Repetitive preemption on low-priority Pods #431

Closed

zhiyxu mentioned this issue Nov 2, 2022

[Coscheduling] Issues about priority preemption #443

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2023

KunWuLuan mentioned this issue Mar 29, 2023

add podGroup backoff time for coscheduling #559

Merged

k8s-ci-robot assigned KunWuLuan Mar 30, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2023

Huang-Wei closed this as completed Sep 12, 2023

kerthcet mentioned this issue Nov 7, 2023

[Coscheduling] make podGroup a queueing unit #658

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[coscheduling] Infinite scheduling loop of distributed tasks block all other pods #429

[coscheduling] Infinite scheduling loop of distributed tasks block all other pods #429

zhiyxu commented Sep 21, 2022

Huang-Wei commented Sep 22, 2022

Huang-Wei commented Sep 22, 2022

k8s-triage-robot commented Dec 21, 2022

Huang-Wei commented Dec 21, 2022

k8s-triage-robot commented Mar 21, 2023

Huang-Wei commented Mar 22, 2023

KunWuLuan commented Mar 30, 2023

Huang-Wei commented Mar 30, 2023

KunWuLuan commented Mar 31, 2023

KunWuLuan commented Mar 31, 2023

k8s-triage-robot commented Jun 29, 2023

KunWuLuan commented Sep 12, 2023

Huang-Wei commented Sep 12, 2023

[coscheduling] Infinite scheduling loop of distributed tasks block all other pods #429

[coscheduling] Infinite scheduling loop of distributed tasks block all other pods #429

Comments

zhiyxu commented Sep 21, 2022

Huang-Wei commented Sep 22, 2022

Huang-Wei commented Sep 22, 2022

k8s-triage-robot commented Dec 21, 2022

Huang-Wei commented Dec 21, 2022

k8s-triage-robot commented Mar 21, 2023

Huang-Wei commented Mar 22, 2023

KunWuLuan commented Mar 30, 2023

Huang-Wei commented Mar 30, 2023

KunWuLuan commented Mar 31, 2023

KunWuLuan commented Mar 31, 2023

k8s-triage-robot commented Jun 29, 2023

KunWuLuan commented Sep 12, 2023

Huang-Wei commented Sep 12, 2023