Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controllers: fix miscalculation of RunningDuration when killing job #3719

Merged
merged 1 commit into from
Sep 25, 2024

Conversation

matbme
Copy link
Contributor

@matbme matbme commented Sep 11, 2024

When restarting a controller pod, it calls killJob on completed jobs which will erroneously change the RunningDuration value for said job, even though it had already been completed. This PR fixes the issue by using the last transition time instead of current time to determine total running duration inside killJob.

Closes #3715.

@volcano-sh-bot
Copy link
Contributor

Welcome @matbme!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot volcano-sh-bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Sep 11, 2024
@volcano-sh-bot volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 11, 2024
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Sep 11, 2024
@matbme
Copy link
Contributor Author

matbme commented Sep 11, 2024

/assign @Monokaix

Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test
/lgtm

@volcano-sh-bot volcano-sh-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. lgtm Indicates that a PR is ready to be merged. labels Sep 12, 2024
@Monokaix
Copy link
Member

/lgtm

@Monokaix
Copy link
Member

Hi, please rebase master.

@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2024
@matbme
Copy link
Contributor Author

matbme commented Sep 16, 2024

/assign @hwdef

@lowang-bh
Copy link
Member

Could you please paste a test result or add an ut?

@matbme
Copy link
Contributor Author

matbme commented Sep 17, 2024

Here's a manual test I did with and without this PR applied:

# ---------- Without PR ----------

$ k describe jobs.batch.volcano.sh test-workload-1 | grep -B 11 'Running Duration'
Status:
  Conditions:
    Last Transition Time:  2024-09-17T19:25:22Z
    Status:                Pending
    Last Transition Time:  2024-09-17T19:25:54Z
    Status:                Running
    Last Transition Time:  2024-09-17T19:26:12Z
    Status:                Completing
    Last Transition Time:  2024-09-17T19:26:12Z
    Status:                Completed
  Min Available:           1
  Running Duration:        6m8.810943003s
  
$ k -n volcano-system rollout restart deployment volcano-controllers
deployment.apps/volcano-controllers restarted

$ k describe jobs.batch.volcano.sh test-workload-1 | grep -B 11 'Running Duration'
Status:
  Conditions:
    Last Transition Time:  2024-09-17T19:25:22Z
    Status:                Pending
    Last Transition Time:  2024-09-17T19:25:54Z
    Status:                Running
    Last Transition Time:  2024-09-17T19:26:12Z
    Status:                Completing
    Last Transition Time:  2024-09-17T19:26:12Z
    Status:                Completed
  Min Available:           1
  Running Duration:        8m36.103534833s
  
  
# ---------- With PR ----------

$ k describe jobs.batch.volcano.sh test-workload-1 | grep -B 11 'Running Duration'
Status:
  Conditions:
    Last Transition Time:  2024-09-17T19:25:22Z
    Status:                Pending
    Last Transition Time:  2024-09-17T19:25:54Z
    Status:                Running
    Last Transition Time:  2024-09-17T19:26:12Z
    Status:                Completing
    Last Transition Time:  2024-09-17T19:26:12Z
    Status:                Completed
  Min Available:           1
  Running Duration:        51s
  
$ k -n volcano-system rollout restart deployment volcano-controllers
deployment.apps/volcano-controllers restarted

$ k describe jobs.batch.volcano.sh test-workload-1 | grep -B 11 'Running Duration'
Status:
  Conditions:
    Last Transition Time:  2024-09-17T19:25:22Z
    Status:                Pending
    Last Transition Time:  2024-09-17T19:25:54Z
    Status:                Running
    Last Transition Time:  2024-09-17T19:26:12Z
    Status:                Completing
    Last Transition Time:  2024-09-17T19:26:12Z
    Status:                Completed
  Min Available:           1
  Running Duration:        51s

Note that Running Duration wasn't correct even before restarting the controllers. My guess is that it was accounting for the time the job was pending, but I'm not entirely sure why my PR "fixes" this.

Regarding unit tests, I originally intended to add a test for this scenario, but had trouble replicating this behavior inside a ut. Please let me know if you have any indicators on how I could achieve this (maybe a e2e would be more fitting).

@Monokaix
Copy link
Member

Please squash to one commit: )

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 19, 2024
Uses the last transition time instead of current time to determine total
running duration.

Signed-off-by: Mateus Melchiades <mateus.melchiades@sap.com>
@matbme
Copy link
Contributor Author

matbme commented Sep 19, 2024

Done!

@hwdef
Copy link
Member

hwdef commented Sep 20, 2024

@william-wang @Thor-wl @shinytang6
Please trigger the CI

@matbme
Copy link
Contributor Author

matbme commented Sep 24, 2024

Hi! Any updates on the CI?

@lowang-bh
Copy link
Member

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2024
@Monokaix
Copy link
Member

/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2024
@volcano-sh-bot volcano-sh-bot merged commit 2adb0c4 into volcano-sh:master Sep 25, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Garbage Collector recalculates job.Status.RunningDuration after restarting controller
5 participants