Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-namespace preemption keeps repeating but low priority task keeps getting allocated before the high priority task #1855

Closed
kenoung opened this issue Nov 25, 2021 · 8 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@kenoung
Copy link

kenoung commented Nov 25, 2021

What happened:
A lower priority task gets preempted by a higher priority task of a different namespace, but keeps getting reallocated. It then gets preempted again, and the cycle repeats.

What you expected to happen:
Higher priority task should get allocated, and lower priority task should be pending.

How to reproduce it (as minimally and precisely as possible):
I adapted the simplified setup from this other issue.
volcano-scheduler.conf

actions: "enqueue, allocate, preempt, backfill"
tiers:
- plugins:
  - name: priority
  # only need gang plugin for JobStarving function if not on master branch
  - name: gang
     enablePreemptable: false
- plugins:
  - name: predicates

Create a queue, two priority classes and two namespaces.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: prod-queue
spec:
  weight: 1
  reclaimable: True
  capability:
    cpu: 4000m
    memory: 4G
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1
---
kind: Namespace
apiVersion: v1
metadata:
  name: a
  labels: 
    name: a
---
kind: Namespace
apiVersion: v1
metadata:
  name: b
  labels: 
    name: b

Start a low priority job using up all the cpu.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vcjob-low-pri
  namespace: a
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: low-priority
  policies:
    - event: PodEvicted
      action: RestartTask
  maxRetry: 100
  queue: prod-queue
  tasks:
    - replicas: 1
      name: "x"
      template:
        metadata:
          name: core
        spec:
          priorityClassName: low-priority
          terminationGracePeriodSeconds: 10
          containers:
            - image: alpine:3
              imagePullPolicy: IfNotPresent
              name: main
              command: ['sh', '-c', "sleep 600000"]
              resources:
                requests:
                  cpu: "4"
                  memory: "256Mi"
          restartPolicy: OnFailure

Start a high priority job.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vcjob-high-pri
  namespace: b
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: high-priority
  policies:
    - event: PodEvicted
      action: RestartTask
  maxRetry: 100
  queue: prod-queue
  tasks:
    - replicas: 1
      name: "x"
      template:
        metadata:
          name: core
        spec:
          priorityClassName: high-priority
          terminationGracePeriodSeconds: 10
          containers:
            - image: alpine:3
              imagePullPolicy: IfNotPresent
              name: main
              command: ['sh', '-c', "sleep 600000"]
              resources:
                requests:
                  cpu: "4"
                  memory: "256Mi"
          restartPolicy: OnFailure

Observe that the low priority pod will get terminated, then another pod will get started to take its place. The new low-priority pod will start, then gets terminated again, and the process repeats.

Anything else we need to know?:
This is my current understanding of the issue. In the preempt phase, we find lower priority tasks to preempt, and if we succeed, we pipeline the higher priority task. However, this pipeline status does not get carried over to the next iteration. When we open a new session, the preempted job has started a new pending pod to take its place. If we encounter the lower priority task first, and the node has sufficient resources, then we'll allocate it, which brings us back to our original state.

Note that this issue occurs only when the high-pri job's namespace is after the low-pri job's namespace. If we swap their namespaces i.e. start the low-pri job in namespace b, then start the high-pri job in namespace a, this issue does not occur. The preemption of the low-pri job and allocation of the high-pri job will succeed.

Environment:

  • Volcano Version: v1.4
  • Kubernetes version (use kubectl version): Client 1.19, Server 1.21
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@kenoung kenoung added the kind/bug Categorizes issue or PR as related to a bug. label Nov 25, 2021
@Thor-wl
Copy link
Contributor

Thor-wl commented Nov 25, 2021

/cc @huone1

@Thor-wl
Copy link
Contributor

Thor-wl commented Dec 2, 2021

I think the analysis is reasonable. The current logic in preempt action is not so reasonable for it just find candidate tasks which are lower priority than the preemptor and then preemt their resources. In fact, it should preempt the candidate tasks which are at the lowest priority. Or it may also bring the ping-pang problem. Can you help reconstruct the logic?

@Thor-wl Thor-wl added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Dec 2, 2021
@kenoung
Copy link
Author

kenoung commented Dec 2, 2021

@Thor-wl Hi! Could you clarify whether NamespaceOrder or PriorityClass takes precedence?

For example, given an empty cluster that only has sufficient resources for a single job, namespace a launches a low-priority job, and namespace b launches a high-priority job, do we expect a/low-pri-job or b/high-pri-job to run?

The current allocate action will run a/low-pri-job. If preemption is enabled, then b/high-pri-job will preempt a/low-pri-job whenever it starts running, but the allocate action will always choose to allocate a/low-pri-job again, which results in the ping-pong problem you mentioned.

If our conclusion is that PriorityClass takes precedence, and high priority jobs should always run in front of lower priority jobs regardless of namespace, then we need to rework the allocate action and the fairsharing mechanism completely, while the issue with preempt becomes tangential and fixing that wouldn't resolve this issue.

Otherwise, if NamespaceOrder takes precedence, then we should rework preempt as you've suggested to preempt while taking NamespaceOrder into consideration.

@stale
Copy link

stale bot commented Mar 2, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2022
@Thor-wl Thor-wl removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2022
@stale
Copy link

stale bot commented Jun 11, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2022
@stale
Copy link

stale bot commented Aug 10, 2022

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Aug 10, 2022
@Thor-wl Thor-wl removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 12, 2022
@Thor-wl Thor-wl reopened this Aug 12, 2022
@stale
Copy link

stale bot commented Nov 12, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2022
@stale
Copy link

stale bot commented Jan 22, 2023

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jan 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

2 participants