-
Notifications
You must be signed in to change notification settings - Fork 972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: job controller would keep trying to Execute action #1062
Conversation
Welcome @liuchintao! |
Travis tests have failedHey @liuchintao, TravisBuddy Request Identifier: 65e12b20-fd83-11ea-a706-8f9ec0eccd66 |
@@ -41,6 +41,9 @@ spec: | |||
description: Specification of the desired behavior of the volcano job, including | |||
the minAvailable | |||
properties: | |||
keepSchedule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe your usage scene?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, in my namespace foo, there is a resource quota for 5 cpus.
apiVersion: v1
kind: ResourceQuota
metadata:
name: quota-cpu
namespace: foo
spec:
hard:
cpu: "5"
And my mpijob just need 5 cpu, a mpi-launcher need 1 cpu, 2 mpi-workers need 2 cpu separately. Everything looks good.
But before I start my mpi job, I start a pod that consumes 2 cpu.
Now I just have 2 mpi pods, a launcher and a worker, I have to wait for pod completes and releases resources.
But now, job controller reaches 15 retries, and drops this req from worker's queue.
I can't get the last pod any more, even the former pod releases resources.
And the job also not releases the resources it holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, in my namespace foo, there is a resource quota for 5 cpus.
apiVersion: v1 kind: ResourceQuota metadata: name: quota-cpu namespace: foo spec: hard: cpu: "5"And my mpijob just need 5 cpu, a mpi-launcher need 1 cpu, 2 mpi-workers need 2 cpu separately. Everything looks good.
But before I start my mpi job, I start a pod that consumes 2 cpu.Now I just have 2 mpi pods, a launcher and a worker, I have to wait for pod completes and releases resources.
But now, job controller reaches 15 retries, and drops this req from worker's queue.
I can't get the last pod any more, even the former pod releases resources.
And the job also not releases the resources it holds.
Perhaps it seems reasonable in some scenes as you describe. But it may cause dead lock. For example, there are 10 cpus in the cluster. And another end-user launch a job which is same with your condition. It happens that he is also running part of his tasks and waiting your running tasks to release resouce. If you both can be wait for each other forever, your jobs both will be unfinished forever. Scheduler must take some actions to release some resources to resolve the problem. That's the reason why we set 15 tries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you both can be wait for each other forever, your jobs both will be unfinished forever. Scheduler must take some actions to release some resources to resolve the problem.
Yeah, you are right, this would lead to dead lock, maybe I should resolve this problem from the perspect of "How to release resources after controller retries 15 times".
But now, job controller just drop job out of the worker's queue and do nothing.
In the above case:
- cpus are allocated, but no job works;
- controller drops the
sync request
out of the queue, IMO job's status can not be synced; - cpus only could be released until job is deleted.
How about releasing job's resources after retry 15 times?
Every xxxState
should Execute(TerminateJobAction)
after failed times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you both can be wait for each other forever, your jobs both will be unfinished forever. Scheduler must take some actions to release some resources to resolve the problem.
Yeah, you are right, this would lead to dead lock, maybe I should resolve this problem from the perspect of "How to release resources after controller retries 15 times".
But now, job controller just drop job out of the worker's queue and do nothing.
In the above case:
- cpus are allocated, but no job works;
- controller drops the
sync request
out of the queue, IMO job's status can not be synced;- cpus only could be released until job is deleted.
How about releasing job's resources after retry 15 times?
EveryxxxState
shouldExecute(TerminateJobAction)
after failed times.
Please show me your scheduler configuration. I guess you have not configed Gang plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is my scheduler configuration
apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
kind: ConfigMap
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is my scheduler configuration
apiVersion: v1 data: volcano-scheduler.conf: | actions: "enqueue, allocate, backfill" tiers: - plugins: - name: priority - name: gang - name: conformance - plugins: - name: drf - name: predicates - name: proportion - name: nodeorder - name: binpack kind: ConfigMap metadata: name: volcano-scheduler-configmap namespace: volcano-system
How do you prove the behavior above? As i know, if gang plugin is configured, only allocatable resource satisfies at leaset minAvailable tasks can the job be running. Namely, minAvailable tasks will be allocated resource at the same time. Is there any log can be shown to help analysis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Thor-wl Sorry, it took so long to reply.
We have deployed the a new environment with our customed scheduler. Is it useful that I describe the situation I met before?
The follow is my node's capacity, cpu are sufficient.
Capacity:
cpu: 48
ephemeral-storage: 2228605748Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263602412Ki
pods: 500
I want to test the behavior of scheduler with k8s resource quota, so I make a namespace that just be allocated 5 cpus.
apiVersion: v1
kind: List
items:
- apiVersion: v1
kind: ResourceQuota
metadata:
name: test-quota-cpu
namespace: foo
spec:
hard:
cpu: "5"
---
apiVersion: v1
kind: Pod
metadata:
name: consume-2-cpu
namespace: foo
spec:
containers:
- name: test
image: busybox:latest
command: ["/bin/sh", "-c", "tail -f /dev/null"]
resources:
requests:
cpu: 2
limits:
cpu: 2
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: big-cpu-mpi-job
namespace: foo
spec:
minAvailable: 3
schedulerName: volcano
keepSchedule: true
plugins:
ssh: []
svc: []
tasks:
- replicas: 1
name: mpimaster
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`;
mkdir -p /var/run/sshd; /usr/sbin/sshd;
mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world > /home/re;
image: example-mpi:0.0.1
name: mpimaster
ports:
- containerPort: 22
name: mpijob-port
workingDir: /home
resources:
limits:
cpu: 1
restartPolicy: OnFailure
- replicas: 2
name: mpiworker
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
image: example-mpi:0.0.1
name: mpiworker
ports:
- containerPort: 22
name: mpijob-port
workingDir: /home
resources:
limits:
cpu: 2
restartPolicy: OnFailure
I guess volcano scheduler maybe just consider node's resource and ignores namespace's resource quota.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so Volcano should consider resource quota. It's an import point indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so Volcano should consider resource quota. It's an import point indeed.
Yeah, from this perspective, this is pr is a workaround to deal with this issue, I find pr #1100 considers resource quota.
@@ -338,6 +338,8 @@ func (cc *jobcontroller) processNextReq(count uint32) bool { | |||
cc.recordJobEvent(jobInfo.Job.Namespace, jobInfo.Job.Name, batchv1alpha1.ExecuteAction, fmt.Sprintf( | |||
"Job failed on action %s for retry limit reached", action)) | |||
klog.Warningf("Dropping job<%s/%s> out of the queue: %v because max retries has reached", jobInfo.Job.Namespace, jobInfo.Job.Name, err) | |||
err = st.Execute(busv1alpha1.TerminateJobAction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also check the return error of this execution.
/approve |
Maybe we should not merge this PR for risk of dead lock. We should consider it more. |
/approve cancel |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: liuchintao The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc @william-wang |
Signed-off-by: liujintao <liujintao@cambricon.com>
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
more info: #1031
Job controller just retry 15 times to execute job's action, it's enough in most cases.
After job controller retry and failed 15 times, worker just drop this req from working queue but do nothing.
If worker's failed times exceeds limited max retry times when executes
syncJob
action,the job would stay in
pending
forever, and the resources which hold by job's pod would not be released.