Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: persist previous mounts on client to restore during restart #17840

Merged
merged 1 commit into from
Jul 10, 2023

Commits on Jul 7, 2023

  1. CSI: persist previous mounts on client to restore during restart

    When claiming a CSI volume, we need to ensure the CSI node plugin is running
    before we send any CSI RPCs. This extends even to the controller publish RPC
    because it requires the storage provider's "external node ID" for the
    client. This primarily impacts client restarts but also is a problem if the node
    plugin exits (and fingerprints) while the allocation that needs a CSI volume
    claim is being placed.
    
    Unfortunately there's no mapping of volume to plugin ID available in the
    jobspec, so we don't have enough information to wait on plugins until we either
    get the volume from the server or retrieve the plugin ID from data we've
    persisted on the client.
    
    If we always require getting the volume from the server before making the claim,
    a client restart for disconnected clients will cause all the allocations that
    need CSI volumes to fail. Even while connected, checking in with the server to
    verify the volume's plugin before trying to make a claim RPC is inherently racy,
    so we'll leave that case as-is and it will fail the claim if the node plugin
    needed to support a newly-placed allocation is flapping such that the node
    fingerprint is changing.
    
    This changeset persists a minimum subset of data about the volume and its plugin
    in the client state DB, and retrieves that data during the CSI hook's prerun to
    avoid re-claiming and remounting the volume unnecessarily.
    
    This changeset also updates the RPC handler to use the external node ID from the
    claim whenever it is available.
    
    Fixes: #13028
    tgross committed Jul 7, 2023
    Configuration menu
    Copy the full SHA
    7a4f0af View commit details
    Browse the repository at this point in the history