Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: no early return when feasibility check fails on eligible nodes #13274

Merged
merged 1 commit into from
Jun 7, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Jun 7, 2022

Fixes #11778

As a performance optimization in the scheduler, feasibility checks
that apply to an entire class are only checked once for all nodes of
that class. Other feasibility checks are "available" checks because
they rely on more ephemeral characteristics and don't contribute to
the hash for the node class. This currently includes only CSI.

We have a separate fast path for "available" checks when the node has
already been marked eligible on the basis of class. This fast path has
a bug where it returns early rather than continuing the loop. This
causes the entire task group to be rejected.

Fix the bug by not returning early in the fast path and instead jump
to the top of the loop like all the other code paths in this method.
Includes a new test exercising topology at whole-scheduler level and a
fix for an existing test that should've caught this previously.


Note that the test here is potentially limited in reliability because of the random iterator, but:

$ go test -v ./scheduler -run TestServiceSched_CSITopology -count=100
...
--- PASS: TestServiceSched_CSITopology (0.00s)
PASS
ok      github.com/hashicorp/nomad/scheduler    0.424s

@tgross tgross added theme/storage backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line type/bug labels Jun 7, 2022
@tgross tgross added this to the 1.3.2 milestone Jun 7, 2022
@tgross tgross requested review from schmichael and lgfa29 June 7, 2022 15:17
@tgross tgross marked this pull request as ready for review June 7, 2022 15:17
@tgross tgross requested a review from shoenig June 7, 2022 15:17
As a performance optimization in the scheduler, feasibility checks
that apply to an entire class are only checked once for all nodes of
that class. Other feasibility checks are "available" checks because
they rely on more ephemeral characteristics and don't contribute to
the hash for the node class. This currently includes only CSI.

We have a separate fast path for "available" checks when the node has
already been marked eligible on the basis of class. This fast path has
a bug where it returns early rather than continuing the loop. This
causes the entire task group to be rejected.

Fix the bug by not returning early in the fast path and instead jump
to the top of the loop like all the other code paths in this method.
Includes a new test exercising topology at whole-scheduler level and a
fix for an existing test that should've caught this previously.
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 line bug fix with 100 lines of testing; LGTM!

@ygersie
Copy link
Contributor

ygersie commented Jun 7, 2022

💯

@tgross tgross merged commit b747123 into main Jun 7, 2022
@tgross tgross deleted the b-csi-feasibility-check branch June 7, 2022 17:31
tbehling pushed a commit that referenced this pull request Jun 29, 2022
…13274)

As a performance optimization in the scheduler, feasibility checks
that apply to an entire class are only checked once for all nodes of
that class. Other feasibility checks are "available" checks because
they rely on more ephemeral characteristics and don't contribute to
the hash for the node class. This currently includes only CSI.

We have a separate fast path for "available" checks when the node has
already been marked eligible on the basis of class. This fast path has
a bug where it returns early rather than continuing the loop. This
causes the entire task group to be rejected.

Fix the bug by not returning early in the fast path and instead jump
to the top of the loop like all the other code paths in this method.
Includes a new test exercising topology at whole-scheduler level and a
fix for an existing test that should've caught this previously.
@github-actions
Copy link

github-actions bot commented Oct 7, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line theme/storage type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CSI volume per_alloc availability zone placement
3 participants