Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: use job status not alloc status for plugin updates from summary #12027

Merged
merged 3 commits into from
Feb 9, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Feb 8, 2022

Fixes #9810 #10073

When an allocation is updated, the job summary for the associated job is also updated. CSI uses the job summary to set the expected count for controller and node plugins. We incorrectly used the allocation's server status instead of the job status when deciding whether to update or remove the job from the plugins. This caused a node drain or other terminal state for an allocation to clear the expected count for the entire plugin.

Use the job status to guide whether to update or remove the expected count.

Notes for reviewers:

  • The actual bug fix is this one-liner: aa583a0
  • We missed this bug because the CSI plugin lifecycle state tests incorrectly modeled the updates we received from servers vs those we received from clients, leading to test assertions that passed when they should not. And just generally, the tests were hard to read and understand. I've heavily reworked the tests to clarify each step in the lifecycle of plugin allocations with a subtest. That's all in 2ae8ad2 (also, that diff might be easier to read if you do a side-by-side).
  • There's still some odd behavior around plugin GC but I'm leaving that as out-of-scope for this changeset. See Unable to normally stop and purge system job with csi plugin #11758 for one example of that, but we also intentionally delete plugins without instances even if they still had volumes, which seems like a really dubious choice now so I want to follow up on that.

@tgross
Copy link
Member Author

tgross commented Feb 8, 2022

Currently the failing unit tests are because of #12028. I'll rebase on main once that's merged. Done

The existing CSI tests for the state store incorrectly modeled the
updates we received from servers vs those we received from clients,
leading to test assertions that passed when they should not.

Rework the tests to clarify each step in the lifecycle.
When an allocation is updated, the job summary for the associated job
is also updated. CSI uses the job summary to set the expected count
for controller and node plugins. We incorrectly used the allocation's
server status instead of the job status when deciding whether to
update or remove the job from the plugins. This caused a node drain or
other terminal state for an allocation to clear the expected count for
the entire plugin.

Use the job status to guide whether to update or remove the expected
count.
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I really like this test refactoring

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Draining a node confuses CSI plugin node health
3 participants