Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI Volume Attachment/Detachment #6904

Closed
endocrimes opened this issue Jan 7, 2020 · 4 comments
Closed

CSI Volume Attachment/Detachment #6904

endocrimes opened this issue Jan 7, 2020 · 4 comments

Comments

@endocrimes
Copy link
Contributor

This is a placeholder issue for designing how nomad servers should manage the attachment of CSI Volumes, and how they should eventually be detached from a node.

Considerations

  • Servers are currently unaware of when a Client Allocation has been Garbage Collected. This is problematic, because currently Nomad does not clean up the local filesystem until the alloc runner is destroyed during Garbage Collection. Volumes will need to be a special case here, and be cleaned up when the Alloc goes terminal. This will need to be documented.
  • Regardless of Kill Allocations when client is disconnected from servers #2185, we will need to block volumes until unblocked by operators as we cannot guarantee that the allocation will have shutdown / finished writing, and cloud providers do not guarantee that no write-path still exists.
  • Busy clusters may have lots of attachment/detachments, but RPCs to CSI Plugins may be slow/expensive, we need to do some form of throttling. (Single threaded daemon? One thread per plugin per dc?)
  • Nomad Plans will need to take into account free volume slots on the node for each plugin
@tgross
Copy link
Member

tgross commented Jan 31, 2020

Partially for my own understanding and to help sync up understanding in a discussion with @langmartin, I've broken this down in a little more detail into something that might make a good section of the RFC at some point:

Publishing Volumes

When the client receives an allocation which includes a CSI volume, the alloc runner will block its main Run loop.

  • The alloc runner will issue a VolumeClaim RPC to the Nomad servers. Note that this is required to track the claim even if the plugin doesn't have a Controller.
  • The Nomad server will add the claim to the state store, and if neccessary issue a ControllerPublishVolume CSI PRC to the Controller plugin. The VolumeClaim RPC response will include enough information for the client to decide if it can retry or if the request is invalid and the client should abort the allocation runner.
  • On success, the client will issue a NodeStageVolume CSI RPC to the Node plugin if there is no other claim on that volume on the node.
  • On success, the client will issue a NodePublishVolume CSI RPC to the Node plugin.
  • If all 3 of the above RPCs are successful, the client has a promise that the volume will eventually be available in the LocalVolumeStatus, so it begins polling that endpoint (with a VolumeReadyTimeout). If the client reaches the VolumeReadyTimeout, it marks the alloc as terminal.
  • The client mounts the volume and unblocks the alloc runner's Run loop.

Tasks:

Unpublishing Volumes

When an alloc is marked terminal, the client will issue a NodeUnstageVolume and NodeUnpublishVolume CSI RPC to its volumes and wait for the response (these must be complete before any ControllerUnpublishVolume completes, so we need to block here). The client then syncs the terminal alloc state with the server via Node.UpdateAlloc RPC (every 200ms).

The Node.UpdateAlloc RPC handler at the server will emit a volume claim eval for allocs that are terminal and have a CSI volume. This eval is handled by the core job scheduler. It issues a ControllerUnpublishVolume RPC for allocs that have a controller plugin. Once this returns (or is skipped), the server will release the volume claim for that allocation.

Tasks:

Node Failure

The unpublishing workflow above assumes the cooperation of a healthy client that can report that allocs are terminal. There will be two ways we'll reconcile this:

  1. A client that is out-of-contact with the server for too long will mark its allocs as terminal, which will call NodeUnstageVolume and NodeUnpublishVolume.

  2. The server will also mark the allocs as terminal. The server garbage-collects jobs in the scheduled core job job-gc every 5 minutes. If the job's allocations have CSI volume claims, we'll call the same volume claim garbage collection we do on normal Node.UpdateAlloc calls for terminal allocs. That'll result in the server issuing a ControlerUnpublishVolume for any remaining claims, and release the volume claim.

Tasks:

@endocrimes
Copy link
Contributor Author

@tgross This sgtm

@tgross
Copy link
Member

tgross commented Feb 14, 2020

Going to close this in lieu of the broken-out issues.

@tgross tgross closed this as completed Feb 14, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants