Cross-namespace preemption keeps repeating but low priority task keeps getting allocated before the high priority task #1855

kenoung · 2021-11-25T03:53:24Z

What happened:
A lower priority task gets preempted by a higher priority task of a different namespace, but keeps getting reallocated. It then gets preempted again, and the cycle repeats.

What you expected to happen:
Higher priority task should get allocated, and lower priority task should be pending.

How to reproduce it (as minimally and precisely as possible):
I adapted the simplified setup from this other issue.
volcano-scheduler.conf

actions: "enqueue, allocate, preempt, backfill"
tiers:
- plugins:
  - name: priority
  # only need gang plugin for JobStarving function if not on master branch
  - name: gang
     enablePreemptable: false
- plugins:
  - name: predicates

Create a queue, two priority classes and two namespaces.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: prod-queue
spec:
  weight: 1
  reclaimable: True
  capability:
    cpu: 4000m
    memory: 4G
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1
---
kind: Namespace
apiVersion: v1
metadata:
  name: a
  labels: 
    name: a
---
kind: Namespace
apiVersion: v1
metadata:
  name: b
  labels: 
    name: b

Start a low priority job using up all the cpu.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vcjob-low-pri
  namespace: a
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: low-priority
  policies:
    - event: PodEvicted
      action: RestartTask
  maxRetry: 100
  queue: prod-queue
  tasks:
    - replicas: 1
      name: "x"
      template:
        metadata:
          name: core
        spec:
          priorityClassName: low-priority
          terminationGracePeriodSeconds: 10
          containers:
            - image: alpine:3
              imagePullPolicy: IfNotPresent
              name: main
              command: ['sh', '-c', "sleep 600000"]
              resources:
                requests:
                  cpu: "4"
                  memory: "256Mi"
          restartPolicy: OnFailure

Start a high priority job.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vcjob-high-pri
  namespace: b
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: high-priority
  policies:
    - event: PodEvicted
      action: RestartTask
  maxRetry: 100
  queue: prod-queue
  tasks:
    - replicas: 1
      name: "x"
      template:
        metadata:
          name: core
        spec:
          priorityClassName: high-priority
          terminationGracePeriodSeconds: 10
          containers:
            - image: alpine:3
              imagePullPolicy: IfNotPresent
              name: main
              command: ['sh', '-c', "sleep 600000"]
              resources:
                requests:
                  cpu: "4"
                  memory: "256Mi"
          restartPolicy: OnFailure

Observe that the low priority pod will get terminated, then another pod will get started to take its place. The new low-priority pod will start, then gets terminated again, and the process repeats.

Anything else we need to know?:
This is my current understanding of the issue. In the preempt phase, we find lower priority tasks to preempt, and if we succeed, we pipeline the higher priority task. However, this pipeline status does not get carried over to the next iteration. When we open a new session, the preempted job has started a new pending pod to take its place. If we encounter the lower priority task first, and the node has sufficient resources, then we'll allocate it, which brings us back to our original state.

Note that this issue occurs only when the high-pri job's namespace is after the low-pri job's namespace. If we swap their namespaces i.e. start the low-pri job in namespace b, then start the high-pri job in namespace a, this issue does not occur. The preemption of the low-pri job and allocation of the high-pri job will succeed.

Environment:

Volcano Version: v1.4
Kubernetes version (use kubectl version): Client 1.19, Server 1.21
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

Thor-wl · 2021-11-25T07:32:07Z

/cc @huone1

Thor-wl · 2021-12-02T01:44:06Z

I think the analysis is reasonable. The current logic in preempt action is not so reasonable for it just find candidate tasks which are lower priority than the preemptor and then preemt their resources. In fact, it should preempt the candidate tasks which are at the lowest priority. Or it may also bring the ping-pang problem. Can you help reconstruct the logic?

kenoung · 2021-12-02T10:05:35Z

@Thor-wl Hi! Could you clarify whether NamespaceOrder or PriorityClass takes precedence?

For example, given an empty cluster that only has sufficient resources for a single job, namespace a launches a low-priority job, and namespace b launches a high-priority job, do we expect a/low-pri-job or b/high-pri-job to run?

The current allocate action will run a/low-pri-job. If preemption is enabled, then b/high-pri-job will preempt a/low-pri-job whenever it starts running, but the allocate action will always choose to allocate a/low-pri-job again, which results in the ping-pong problem you mentioned.

If our conclusion is that PriorityClass takes precedence, and high priority jobs should always run in front of lower priority jobs regardless of namespace, then we need to rework the allocate action and the fairsharing mechanism completely, while the issue with preempt becomes tangential and fixing that wouldn't resolve this issue.

Otherwise, if NamespaceOrder takes precedence, then we should rework preempt as you've suggested to preempt while taking NamespaceOrder into consideration.

stale · 2022-03-02T10:26:22Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale · 2022-06-11T03:48:58Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale · 2022-08-10T04:18:16Z

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

stale · 2022-11-12T05:12:13Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale · 2023-01-22T08:17:12Z

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

kenoung added the kind/bug Categorizes issue or PR as related to a bug. label Nov 25, 2021

Thor-wl added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Dec 2, 2021

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2022

Thor-wl removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2022

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2022

stale bot closed this as completed Aug 10, 2022

Thor-wl removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2022

Thor-wl reopened this Aug 12, 2022

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2022

stale bot closed this as completed Jan 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-namespace preemption keeps repeating but low priority task keeps getting allocated before the high priority task #1855

Cross-namespace preemption keeps repeating but low priority task keeps getting allocated before the high priority task #1855

kenoung commented Nov 25, 2021

Thor-wl commented Nov 25, 2021

Thor-wl commented Dec 2, 2021

kenoung commented Dec 2, 2021

stale bot commented Mar 2, 2022

stale bot commented Jun 11, 2022

stale bot commented Aug 10, 2022

stale bot commented Nov 12, 2022

stale bot commented Jan 22, 2023

Cross-namespace preemption keeps repeating but low priority task keeps getting allocated before the high priority task #1855

Cross-namespace preemption keeps repeating but low priority task keeps getting allocated before the high priority task #1855

Comments

kenoung commented Nov 25, 2021

Thor-wl commented Nov 25, 2021

Thor-wl commented Dec 2, 2021

kenoung commented Dec 2, 2021

stale bot commented Mar 2, 2022

stale bot commented Jun 11, 2022

stale bot commented Aug 10, 2022

stale bot commented Nov 12, 2022

stale bot commented Jan 22, 2023