Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter node draining handling when using CSI plugins #11614

Closed
RickyGrassmuck opened this issue Dec 3, 2021 · 4 comments · Fixed by #12324
Closed

Smarter node draining handling when using CSI plugins #11614

RickyGrassmuck opened this issue Dec 3, 2021 · 4 comments · Fixed by #12324
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/enhancement
Milestone

Comments

@RickyGrassmuck
Copy link
Contributor

Proposal

Currently, when performing a node drain on a client that has a CSI plugin running as a system task, you have to go through the workflow below in order to drain the node without leaving volumes attached to it:

  1. Perform a drain on the node with the "Drain System Jobs" options turned off
  2. Wait for all jobs to migrate off of the node and wait for them to be reallocated (to ensure the volumes have been completely detached from the node)
  3. Perform a second node drain with the "Drain System Jobs" option enabled to shutdown the CSI Plugin on that node.

Failure to perform the drain procedure in steps like this can result in volumes being left attached to a node causing jobs that were drained to not be able to reschedule.

There are probably multiple ways to address this scenario but I think ideally, the node draining logic would be able to tap into the CSI Plugin state to determine if there are tasks with volumes mounted on the node being drained and if so, wait for each of the volumes to be detached before spinning down the CSI plugin jobs registered to handle those volumes.

Use-cases

Being able to perform a single Node Drain and have jobs using CSI volumes be properly handled to ensure they are able to be rescheduled appropriately.

Attempted Solutions

As mentioned above, this can be worked around by performing the drain in multiple steps, first without draining system jobs, then draining system jobs after everything has migrated.

@DerekStrickland DerekStrickland added this to Needs Triage in Nomad - Community Issues Triage via automation Dec 3, 2021
@DerekStrickland DerekStrickland moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Dec 3, 2021
@jrasell jrasell added stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-investigation theme/storage labels Dec 6, 2021
@jrasell
Copy link
Member

jrasell commented Dec 6, 2021

Hi @rigrassm, thanks a lot for rising this issue. I've marked it so we can further discuss this when it is roadmapped.

@tgross
Copy link
Member

tgross commented Mar 18, 2022

I've opened #12324 with a change that defers plugins until last, just as we currently do with system jobs. When combined with #11892, which will also ship in Nomad 1.3.0, we should be covered for this feature request. Thanks for opening the issue @RickyGrassmuck!

@tgross tgross added this to the 1.3.0 milestone Mar 18, 2022
Nomad - Community Issues Triage automation moved this from In Progress to Done Mar 22, 2022
@tgross
Copy link
Member

tgross commented Mar 22, 2022

#12324 has been merged and will ship in Nomad 1.3.0

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/enhancement
Projects
Development

Successfully merging a pull request may close this issue.

3 participants