CSI Volume Attachment/Detachment #6904

endocrimes · 2020-01-07T13:15:02Z

This is a placeholder issue for designing how nomad servers should manage the attachment of CSI Volumes, and how they should eventually be detached from a node.

Considerations

Servers are currently unaware of when a Client Allocation has been Garbage Collected. This is problematic, because currently Nomad does not clean up the local filesystem until the alloc runner is destroyed during Garbage Collection. Volumes will need to be a special case here, and be cleaned up when the Alloc goes terminal. This will need to be documented.
Regardless of Kill Allocations when client is disconnected from servers #2185, we will need to block volumes until unblocked by operators as we cannot guarantee that the allocation will have shutdown / finished writing, and cloud providers do not guarantee that no write-path still exists.
Busy clusters may have lots of attachment/detachments, but RPCs to CSI Plugins may be slow/expensive, we need to do some form of throttling. (Single threaded daemon? One thread per plugin per dc?)
Nomad Plans will need to take into account free volume slots on the node for each plugin

tgross · 2020-01-31T13:27:59Z

Partially for my own understanding and to help sync up understanding in a discussion with @langmartin, I've broken this down in a little more detail into something that might make a good section of the RFC at some point:

Publishing Volumes

When the client receives an allocation which includes a CSI volume, the alloc runner will block its main Run loop.

The alloc runner will issue a VolumeClaim RPC to the Nomad servers. Note that this is required to track the claim even if the plugin doesn't have a Controller.
The Nomad server will add the claim to the state store, and if neccessary issue a ControllerPublishVolume CSI PRC to the Controller plugin. The VolumeClaim RPC response will include enough information for the client to decide if it can retry or if the request is invalid and the client should abort the allocation runner.
On success, the client will issue a NodeStageVolume CSI RPC to the Node plugin if there is no other claim on that volume on the node.
On success, the client will issue a NodePublishVolume CSI RPC to the Node plugin.
If all 3 of the above RPCs are successful, the client has a promise that the volume will eventually be available in the LocalVolumeStatus, so it begins polling that endpoint (with a VolumeReadyTimeout). If the client reaches the VolumeReadyTimeout, it marks the alloc as terminal.
The client mounts the volume and unblocks the alloc runner's Run loop.

Tasks:

implement client-side configuration for VolumeReadyTimeout. CSI: implement VolumeReadyTimeout client configuration #7033
implement the VolumeClaimRPC endpoint. CSI: implement VolumeClaim RPC endpoint #7034
state tracking for volume staging: CSI Node: Track in-use volumes for Node Stage/Unstaging #7029
implement the client-side hooks to issue NodeStageVolume: CSI Node: Track in-use volumes for Node Stage/Unstaging #7029 (PR open in csi: Initial support for unstaging volumes #7031)
implement the client-side hooks to issue NodePublishVolume: CSI Node: Publish/Unpublish Volumes #7030
implement blocking the alloc runner Run loop on LocalVolumeStatus. CSI: implement blocking alloc runner for volumes #7035

Unpublishing Volumes

When an alloc is marked terminal, the client will issue a NodeUnstageVolume and NodeUnpublishVolume CSI RPC to its volumes and wait for the response (these must be complete before any ControllerUnpublishVolume completes, so we need to block here). The client then syncs the terminal alloc state with the server via Node.UpdateAlloc RPC (every 200ms).

The Node.UpdateAlloc RPC handler at the server will emit a volume claim eval for allocs that are terminal and have a CSI volume. This eval is handled by the core job scheduler. It issues a ControllerUnpublishVolume RPC for allocs that have a controller plugin. Once this returns (or is skipped), the server will release the volume claim for that allocation.

Tasks:

implement client-side hooks to issue NodeUnstageVolume: CSI Node: Track in-use volumes for Node Stage/Unstaging #7029 (PR open in csi: Initial support for unstaging volumes #7031)
implement client-side hooks to issue NodeUnpublishVolume: CSI Node: Publish/Unpublish Volumes #7030
implement updates to Node.UpdateAlloc to include issuing ControllerUnpublishVolume and releasing volume claim. CSI: implement server-side unpublish/unclaim for Node.UpdateAlloc #7036 (PR csi: implement releasing volume claims for terminal allocs #7076)

Node Failure

The unpublishing workflow above assumes the cooperation of a healthy client that can report that allocs are terminal. There will be two ways we'll reconcile this:

A client that is out-of-contact with the server for too long will mark its allocs as terminal, which will call NodeUnstageVolume and NodeUnpublishVolume.
The server will also mark the allocs as terminal. The server garbage-collects jobs in the scheduled core job job-gc every 5 minutes. If the job's allocations have CSI volume claims, we'll call the same volume claim garbage collection we do on normal Node.UpdateAlloc calls for terminal allocs. That'll result in the server issuing a ControlerUnpublishVolume for any remaining claims, and release the volume claim.

Tasks:

back out the separate job-volume-gc PR csi: add empty CSI volume GC to scheduled core job loop #7014 (will remove in PR csi: volume claim garbage collection #7125)
implement client-side "heartyeet" Kill Allocations when client is disconnected from servers #2185
implement server-side ControllerUnpublishVolume hook for terminal allocs. (also CSI: implement server-side unpublish/unclaim for Node.UpdateAlloc #7036)
add releasing volume claims to jobGC. PR csi: volume claim garbage collection #7125

endocrimes · 2020-01-31T13:33:25Z

@tgross This sgtm

tgross · 2020-02-14T20:42:30Z

Going to close this in lieu of the broken-out issues.

github-actions · 2022-11-12T02:31:21Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

endocrimes added type/enhancement stage/thinking theme/storage labels Jan 7, 2020

endocrimes added this to the 0.11.0 milestone Jan 7, 2020

tgross self-assigned this Jan 23, 2020

tgross mentioned this issue Feb 11, 2020

csi: stub methods for server-to-controller RPC calls #7117

Merged

tgross closed this as completed Feb 14, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI Volume Attachment/Detachment #6904

CSI Volume Attachment/Detachment #6904

endocrimes commented Jan 7, 2020

tgross commented Jan 31, 2020 •

edited

Loading

endocrimes commented Jan 31, 2020

tgross commented Feb 14, 2020

github-actions bot commented Nov 12, 2022

CSI Volume Attachment/Detachment #6904

CSI Volume Attachment/Detachment #6904

Comments

endocrimes commented Jan 7, 2020

Considerations

tgross commented Jan 31, 2020 • edited Loading

Publishing Volumes

Unpublishing Volumes

Node Failure

endocrimes commented Jan 31, 2020

tgross commented Feb 14, 2020

github-actions bot commented Nov 12, 2022

tgross commented Jan 31, 2020 •

edited

Loading