feat: job controller would keep trying to Execute action #1062

liuchintao · 2020-09-23T09:46:19Z

more info: #1031

Job controller just retry 15 times to execute job's action, it's enough in most cases.

After job controller retry and failed 15 times, worker just drop this req from working queue but do nothing.
If worker's failed times exceeds limited max retry times when executes syncJob action,
the job would stay in pending forever, and the resources which hold by job's pod would not be released.

volcano-sh-bot · 2020-09-23T09:46:22Z

Welcome @liuchintao!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

TravisBuddy · 2020-09-23T09:59:40Z

Travis tests have failed

Hey @liuchintao,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

TravisBuddy Request Identifier: 65e12b20-fd83-11ea-a706-8f9ec0eccd66

Thor-wl · 2020-09-24T01:22:42Z

config/crd/bases/batch.volcano.sh_jobs.yaml

@@ -41,6 +41,9 @@ spec:
          description: Specification of the desired behavior of the volcano job, including
            the minAvailable
          properties:
+            keepSchedule:


Can you describe your usage scene?

For example, in my namespace foo, there is a resource quota for 5 cpus.

apiVersion: v1 kind: ResourceQuota metadata: name: quota-cpu namespace: foo spec: hard: cpu: "5"

And my mpijob just need 5 cpu, a mpi-launcher need 1 cpu, 2 mpi-workers need 2 cpu separately. Everything looks good.
But before I start my mpi job, I start a pod that consumes 2 cpu.

Now I just have 2 mpi pods, a launcher and a worker, I have to wait for pod completes and releases resources.
But now, job controller reaches 15 retries, and drops this req from worker's queue.
I can't get the last pod any more, even the former pod releases resources.
And the job also not releases the resources it holds.

For example, in my namespace foo, there is a resource quota for 5 cpus.

apiVersion: v1 kind: ResourceQuota metadata: name: quota-cpu namespace: foo spec: hard: cpu: "5"

And my mpijob just need 5 cpu, a mpi-launcher need 1 cpu, 2 mpi-workers need 2 cpu separately. Everything looks good.
But before I start my mpi job, I start a pod that consumes 2 cpu.

Now I just have 2 mpi pods, a launcher and a worker, I have to wait for pod completes and releases resources.
But now, job controller reaches 15 retries, and drops this req from worker's queue.
I can't get the last pod any more, even the former pod releases resources.
And the job also not releases the resources it holds.

Perhaps it seems reasonable in some scenes as you describe. But it may cause dead lock. For example, there are 10 cpus in the cluster. And another end-user launch a job which is same with your condition. It happens that he is also running part of his tasks and waiting your running tasks to release resouce. If you both can be wait for each other forever, your jobs both will be unfinished forever. Scheduler must take some actions to release some resources to resolve the problem. That's the reason why we set 15 tries.

If you both can be wait for each other forever, your jobs both will be unfinished forever. Scheduler must take some actions to release some resources to resolve the problem.

Yeah, you are right, this would lead to dead lock, maybe I should resolve this problem from the perspect of "How to release resources after controller retries 15 times".

But now, job controller just drop job out of the worker's queue and do nothing.
In the above case:

cpus are allocated, but no job works;

controller drops the sync request out of the queue, IMO job's status can not be synced;

cpus only could be released until job is deleted.

How about releasing job's resources after retry 15 times?
Every xxxState should Execute(TerminateJobAction) after failed times.

If you both can be wait for each other forever, your jobs both will be unfinished forever. Scheduler must take some actions to release some resources to resolve the problem.

Yeah, you are right, this would lead to dead lock, maybe I should resolve this problem from the perspect of "How to release resources after controller retries 15 times".

But now, job controller just drop job out of the worker's queue and do nothing.
In the above case:

cpus are allocated, but no job works;

controller drops the sync request out of the queue, IMO job's status can not be synced;

cpus only could be released until job is deleted.

How about releasing job's resources after retry 15 times?
Every xxxState should Execute(TerminateJobAction) after failed times.

Please show me your scheduler configuration. I guess you have not configed Gang plugin.

this is my scheduler configuration

apiVersion: v1 data: volcano-scheduler.conf: | actions: "enqueue, allocate, backfill" tiers: - plugins: - name: priority - name: gang - name: conformance - plugins: - name: drf - name: predicates - name: proportion - name: nodeorder - name: binpack kind: ConfigMap metadata: name: volcano-scheduler-configmap namespace: volcano-system

this is my scheduler configuration

apiVersion: v1 data: volcano-scheduler.conf: | actions: "enqueue, allocate, backfill" tiers: - plugins: - name: priority - name: gang - name: conformance - plugins: - name: drf - name: predicates - name: proportion - name: nodeorder - name: binpack kind: ConfigMap metadata: name: volcano-scheduler-configmap namespace: volcano-system

How do you prove the behavior above? As i know, if gang plugin is configured, only allocatable resource satisfies at leaset minAvailable tasks can the job be running. Namely, minAvailable tasks will be allocated resource at the same time. Is there any log can be shown to help analysis?

@Thor-wl Sorry, it took so long to reply.

We have deployed the a new environment with our customed scheduler. Is it useful that I describe the situation I met before?

The follow is my node's capacity, cpu are sufficient.

Capacity: cpu: 48 ephemeral-storage: 2228605748Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263602412Ki pods: 500

I want to test the behavior of scheduler with k8s resource quota, so I make a namespace that just be allocated 5 cpus.

apiVersion: v1 kind: List items: - apiVersion: v1 kind: ResourceQuota metadata: name: test-quota-cpu namespace: foo spec: hard: cpu: "5" --- apiVersion: v1 kind: Pod metadata: name: consume-2-cpu namespace: foo spec: containers: - name: test image: busybox:latest command: ["/bin/sh", "-c", "tail -f /dev/null"] resources: requests: cpu: 2 limits: cpu: 2 --- apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: big-cpu-mpi-job namespace: foo spec: minAvailable: 3 schedulerName: volcano keepSchedule: true plugins: ssh: [] svc: [] tasks: - replicas: 1 name: mpimaster policies: - event: TaskCompleted action: CompleteJob template: spec: containers: - command: - /bin/sh - -c - | MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`; mkdir -p /var/run/sshd; /usr/sbin/sshd; mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world > /home/re; image: example-mpi:0.0.1 name: mpimaster ports: - containerPort: 22 name: mpijob-port workingDir: /home resources: limits: cpu: 1 restartPolicy: OnFailure - replicas: 2 name: mpiworker template: spec: containers: - command: - /bin/sh - -c - | mkdir -p /var/run/sshd; /usr/sbin/sshd -D; image: example-mpi:0.0.1 name: mpiworker ports: - containerPort: 22 name: mpijob-port workingDir: /home resources: limits: cpu: 2 restartPolicy: OnFailure

I guess volcano scheduler maybe just consider node's resource and ignores namespace's resource quota.

Yeah, so Volcano should consider resource quota. It's an import point indeed.

Yeah, so Volcano should consider resource quota. It's an import point indeed.

Yeah, from this perspective, this is pr is a workaround to deal with this issue, I find pr #1100 considers resource quota.

k82cn · 2020-10-03T01:29:58Z

pkg/controllers/job/job_controller.go

@@ -338,6 +338,8 @@ func (cc *jobcontroller) processNextReq(count uint32) bool {
 		cc.recordJobEvent(jobInfo.Job.Namespace, jobInfo.Job.Name, batchv1alpha1.ExecuteAction, fmt.Sprintf(
 			"Job failed on action %s for retry limit reached", action))
 		klog.Warningf("Dropping job<%s/%s> out of the queue: %v because max retries has reached", jobInfo.Job.Namespace, jobInfo.Job.Name, err)
+		err = st.Execute(busv1alpha1.TerminateJobAction)


please also check the return error of this execution.

k82cn · 2020-10-03T01:30:41Z

/approve

Thor-wl · 2020-10-09T12:19:43Z

/approve

Maybe we should not merge this PR for risk of dead lock. We should consider it more.

k82cn · 2020-10-09T13:55:14Z

/approve cancel

volcano-sh-bot · 2020-10-09T13:55:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liuchintao
To complete the pull request process, please assign hzxuzhonghu
You can assign the PR to them by writing /assign @hzxuzhonghu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/controllers/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k82cn · 2020-10-09T13:55:37Z

/cc @william-wang

Signed-off-by: liujintao <liujintao@cambricon.com>

stale · 2020-12-22T11:38:44Z

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

volcano-sh-bot requested review from hzxuzhonghu and kevin-wangzefeng September 23, 2020 09:46

volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 23, 2020

Thor-wl reviewed Sep 24, 2020

View reviewed changes

k82cn reviewed Oct 3, 2020

View reviewed changes

volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2020

volcano-sh-bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2020

volcano-sh-bot requested a review from william-wang October 9, 2020 13:55

merryzhou mentioned this pull request Oct 14, 2020

vc job status stay pending when job retry times exceed maxRetries(=15) #1029

Closed

feat: release job's resource after reatching max retries

55b4586

Signed-off-by: liujintao <liujintao@cambricon.com>

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2020

stale bot closed this Dec 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: job controller would keep trying to Execute action #1062

feat: job controller would keep trying to Execute action #1062

liuchintao commented Sep 23, 2020

volcano-sh-bot commented Sep 23, 2020

TravisBuddy commented Sep 23, 2020

Thor-wl Sep 24, 2020

liuchintao Sep 24, 2020

Thor-wl Sep 24, 2020

liuchintao Sep 24, 2020

Thor-wl Sep 24, 2020

liuchintao Sep 24, 2020

Thor-wl Sep 29, 2020 •

edited

Loading

liuchintao Oct 15, 2020

Thor-wl Oct 20, 2020

liuchintao Oct 23, 2020

k82cn Oct 3, 2020

k82cn commented Oct 3, 2020

Thor-wl commented Oct 9, 2020

k82cn commented Oct 9, 2020

volcano-sh-bot commented Oct 9, 2020

k82cn commented Oct 9, 2020

stale bot commented Dec 22, 2020

feat: job controller would keep trying to Execute action #1062

feat: job controller would keep trying to Execute action #1062

Conversation

liuchintao commented Sep 23, 2020

volcano-sh-bot commented Sep 23, 2020

TravisBuddy commented Sep 23, 2020

Travis tests have failed

TravisBuddy Request Identifier: 65e12b20-fd83-11ea-a706-8f9ec0eccd66

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Thor-wl Sep 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn commented Oct 3, 2020

Thor-wl commented Oct 9, 2020

k82cn commented Oct 9, 2020

volcano-sh-bot commented Oct 9, 2020

k82cn commented Oct 9, 2020

stale bot commented Dec 22, 2020

Thor-wl Sep 29, 2020 •

edited

Loading