Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: job controller would keep trying to Execute action #1062

Closed
wants to merge 1 commit into from
Closed

feat: job controller would keep trying to Execute action #1062

wants to merge 1 commit into from

Conversation

liuchintao
Copy link

more info: #1031

Job controller just retry 15 times to execute job's action, it's enough in most cases.

After job controller retry and failed 15 times, worker just drop this req from working queue but do nothing.
If worker's failed times exceeds limited max retry times when executes syncJob action,
the job would stay in pending forever, and the resources which hold by job's pod would not be released.

@volcano-sh-bot
Copy link
Contributor

Welcome @liuchintao!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 23, 2020
@TravisBuddy
Copy link

Travis tests have failed

Hey @liuchintao,
Please read the following log in order to understand the failure reason.
It'll be awesome if you fix what's wrong and commit the changes.

TravisBuddy Request Identifier: 65e12b20-fd83-11ea-a706-8f9ec0eccd66

@@ -41,6 +41,9 @@ spec:
description: Specification of the desired behavior of the volcano job, including
the minAvailable
properties:
keepSchedule:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe your usage scene?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in my namespace foo, there is a resource quota for 5 cpus.

  apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: quota-cpu
    namespace: foo
  spec:
    hard:
      cpu: "5"

And my mpijob just need 5 cpu, a mpi-launcher need 1 cpu, 2 mpi-workers need 2 cpu separately. Everything looks good.
But before I start my mpi job, I start a pod that consumes 2 cpu.

Now I just have 2 mpi pods, a launcher and a worker, I have to wait for pod completes and releases resources.
But now, job controller reaches 15 retries, and drops this req from worker's queue.
I can't get the last pod any more, even the former pod releases resources.
And the job also not releases the resources it holds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in my namespace foo, there is a resource quota for 5 cpus.

  apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: quota-cpu
    namespace: foo
  spec:
    hard:
      cpu: "5"

And my mpijob just need 5 cpu, a mpi-launcher need 1 cpu, 2 mpi-workers need 2 cpu separately. Everything looks good.
But before I start my mpi job, I start a pod that consumes 2 cpu.

Now I just have 2 mpi pods, a launcher and a worker, I have to wait for pod completes and releases resources.
But now, job controller reaches 15 retries, and drops this req from worker's queue.
I can't get the last pod any more, even the former pod releases resources.
And the job also not releases the resources it holds.

Perhaps it seems reasonable in some scenes as you describe. But it may cause dead lock. For example, there are 10 cpus in the cluster. And another end-user launch a job which is same with your condition. It happens that he is also running part of his tasks and waiting your running tasks to release resouce. If you both can be wait for each other forever, your jobs both will be unfinished forever. Scheduler must take some actions to release some resources to resolve the problem. That's the reason why we set 15 tries.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you both can be wait for each other forever, your jobs both will be unfinished forever. Scheduler must take some actions to release some resources to resolve the problem.

Yeah, you are right, this would lead to dead lock, maybe I should resolve this problem from the perspect of "How to release resources after controller retries 15 times".

But now, job controller just drop job out of the worker's queue and do nothing.
In the above case:

  1. cpus are allocated, but no job works;
  2. controller drops the sync request out of the queue, IMO job's status can not be synced;
  3. cpus only could be released until job is deleted.

How about releasing job's resources after retry 15 times?
Every xxxState should Execute(TerminateJobAction) after failed times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you both can be wait for each other forever, your jobs both will be unfinished forever. Scheduler must take some actions to release some resources to resolve the problem.

Yeah, you are right, this would lead to dead lock, maybe I should resolve this problem from the perspect of "How to release resources after controller retries 15 times".

But now, job controller just drop job out of the worker's queue and do nothing.
In the above case:

  1. cpus are allocated, but no job works;
  2. controller drops the sync request out of the queue, IMO job's status can not be synced;
  3. cpus only could be released until job is deleted.

How about releasing job's resources after retry 15 times?
Every xxxState should Execute(TerminateJobAction) after failed times.

Please show me your scheduler configuration. I guess you have not configed Gang plugin.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is my scheduler configuration

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system

Copy link
Contributor

@Thor-wl Thor-wl Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is my scheduler configuration

apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system

How do you prove the behavior above? As i know, if gang plugin is configured, only allocatable resource satisfies at leaset minAvailable tasks can the job be running. Namely, minAvailable tasks will be allocated resource at the same time. Is there any log can be shown to help analysis?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Thor-wl Sorry, it took so long to reply.

We have deployed the a new environment with our customed scheduler. Is it useful that I describe the situation I met before?

The follow is my node's capacity, cpu are sufficient.

Capacity:
 cpu:                                 48
 ephemeral-storage:                   2228605748Ki
 hugepages-1Gi:                       0
 hugepages-2Mi:                       0
 memory:                              263602412Ki
 pods:                                500

I want to test the behavior of scheduler with k8s resource quota, so I make a namespace that just be allocated 5 cpus.

apiVersion: v1
kind: List
items:
- apiVersion: v1
  kind: ResourceQuota
  metadata:
    name: test-quota-cpu
    namespace: foo
  spec:
    hard:
      cpu: "5"
---
apiVersion: v1
kind: Pod
metadata:
  name: consume-2-cpu
  namespace: foo
spec:
  containers:
  - name: test
    image: busybox:latest
    command: ["/bin/sh", "-c", "tail -f /dev/null"]
    resources:
      requests:
        cpu: 2
      limits:
        cpu: 2
---
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: big-cpu-mpi-job
  namespace: foo
spec:
  minAvailable: 3
  schedulerName: volcano
  keepSchedule: true
  plugins:
    ssh: []
    svc: []
  tasks:
    - replicas: 1
      name: mpimaster
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  MPI_HOST=`cat /etc/volcano/mpiworker.host | tr "\n" ","`;
                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
                  mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 mpi_hello_world > /home/re;
              image: example-mpi:0.0.1
              name: mpimaster
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /home
              resources:
                limits:
                  cpu: 1
          restartPolicy: OnFailure
    - replicas: 2
      name: mpiworker
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: example-mpi:0.0.1
              name: mpiworker
              ports:
                - containerPort: 22
                  name: mpijob-port
              workingDir: /home
              resources:
                limits:
                  cpu: 2
          restartPolicy: OnFailure

I guess volcano scheduler maybe just consider node's resource and ignores namespace's resource quota.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so Volcano should consider resource quota. It's an import point indeed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so Volcano should consider resource quota. It's an import point indeed.

Yeah, from this perspective, this is pr is a workaround to deal with this issue, I find pr #1100 considers resource quota.

@@ -338,6 +338,8 @@ func (cc *jobcontroller) processNextReq(count uint32) bool {
cc.recordJobEvent(jobInfo.Job.Namespace, jobInfo.Job.Name, batchv1alpha1.ExecuteAction, fmt.Sprintf(
"Job failed on action %s for retry limit reached", action))
klog.Warningf("Dropping job<%s/%s> out of the queue: %v because max retries has reached", jobInfo.Job.Namespace, jobInfo.Job.Name, err)
err = st.Execute(busv1alpha1.TerminateJobAction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also check the return error of this execution.

@k82cn
Copy link
Member

k82cn commented Oct 3, 2020

/approve

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 3, 2020
@Thor-wl
Copy link
Contributor

Thor-wl commented Oct 9, 2020

/approve

Maybe we should not merge this PR for risk of dead lock. We should consider it more.

@k82cn
Copy link
Member

k82cn commented Oct 9, 2020

/approve cancel

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liuchintao
To complete the pull request process, please assign hzxuzhonghu
You can assign the PR to them by writing /assign @hzxuzhonghu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2020
@k82cn
Copy link
Member

k82cn commented Oct 9, 2020

/cc @william-wang

Signed-off-by: liujintao <liujintao@cambricon.com>
@stale
Copy link

stale bot commented Dec 22, 2020

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2020
@stale stale bot closed this Dec 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants