Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of eviction in StrictFIFO to ensure the evicted workload is in the head #2061

Merged
merged 1 commit into from
Apr 26, 2024

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Apr 25, 2024

What type of PR is this?

/kind cleanup
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #2020

Special notes for your reviewer:

In this scenario occasionally the workload pendingAlphaWl gets admitted before useAllAlphaWl
is requeued.

The scenario can be reproduced reliably by inserting a time delay (say 500ms) around here, between r.cache.DeleteWorkload(wl) and !r.queues.AddOrUpdateWorkload(wlCopy).

Does this PR introduce a user-facing change?

Fix handling of eviction in StrictFIFO to ensure the evicted workload is in the head.
Previously, in case of priority-based preemption, it was possible that the lower-priority
workload might get admitted while the higher priority workload is being evicted.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/flake Categorizes issue or PR as related to a flaky test. labels Apr 25, 2024
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 25, 2024
Copy link

netlify bot commented Apr 25, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 3377020
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/662b5d37cbbd2e0008cab6a3

@mimowo
Copy link
Contributor Author

mimowo commented Apr 25, 2024

/assign @alculquicondor
For now this is a minimal change to fix the test, which seems to still capture the intention of the test.
An alternative could be to make sure that pendingAlphaWl and preemptorBetaWl don't fit together, this would probably still capture the intention, but I'm not fully sure.

@alculquicondor
Copy link
Contributor

This is a test for StrictFIFO, so we should not let other workloads slip through.

We need to fix the code instead. I think we need to atomically add the workload to the queue, when removing it from the cache.

@mimowo
Copy link
Contributor Author

mimowo commented Apr 25, 2024

This is a test for StrictFIFO, so we should not let other workloads slip through.

We need to fix the code instead. I think we need to atomically add the workload to the queue, when removing it from the cache.

Sure, I will explore this option for making it atomic.

My reasoning was that StrictFIFO is about the workload which is in queue, but during preemption the workload is temporarily out of queue.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 25, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Apr 25, 2024

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 25, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Apr 25, 2024

We need to fix the code instead. I think we need to atomically add the workload to the queue, when removing it from the cache.

Pushed a change. PTAL. One thing I'm not sure is that now I re-queue immediately in case when BestEffortFIFO is used,
but I didn't find a problematic scenario, and it makes less code, than handling these two types of queues differently.

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 25, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: c7ab2cc316d2f07ef28e2ebf91faf151898ea287

@alculquicondor
Copy link
Contributor

/hold
Put a release note

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 25, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 26, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2024
@mimowo mimowo changed the title Adjust the flaky test for preemption Fix handling of eviction in StrictFIFO to ensure the evicted workload is in the head Apr 26, 2024
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 26, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Apr 26, 2024

/hold
Put a release note

/hold cancel
Done

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 26, 2024
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thank you!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 26, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 92b1951571c937e740940122e17693a62e15c9fa

@tenzen-y
Copy link
Member

/cherry-pick release-0.6

@k8s-infra-cherrypick-robot

@tenzen-y: once the present PR merges, I will cherry-pick it on top of release-0.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot merged commit 62e0a81 into kubernetes-sigs:main Apr 26, 2024
15 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.7 milestone Apr 26, 2024
@k8s-infra-cherrypick-robot

@tenzen-y: #2061 failed to apply on top of branch "release-0.6":

Applying: Adjust the flaky test for preemption
Using index info to reconstruct a base tree...
M	pkg/controller/core/workload_controller.go
M	pkg/queue/manager.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/queue/manager.go
Auto-merging pkg/controller/core/workload_controller.go
CONFLICT (content): Merge conflict in pkg/controller/core/workload_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Adjust the flaky test for preemption
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mimowo
Copy link
Contributor Author

mimowo commented Apr 26, 2024

@tenzen-y: once the present PR merges, I will cherry-pick it on top of release-0.6 in a new PR and assign it to you.

I will prepare the branch manually

@mimowo
Copy link
Contributor Author

mimowo commented Apr 26, 2024

Actually, it turns out the conflict is only with #2062, and both seem reasonable to include. WDYT?

@tenzen-y
Copy link
Member

Improve logging of workload status #2062

As I mentioned here (#2062 (comment)), +1 on @mimowo

@tenzen-y
Copy link
Member

/cherry-pick release-0.6

@k8s-infra-cherrypick-robot

@tenzen-y: new pull request created: #2081

In response to this:

/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
5 participants