Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: improve controller RPC reliability #17996

Merged
merged 2 commits into from
Jul 20, 2023
Merged

CSI: improve controller RPC reliability #17996

merged 2 commits into from
Jul 20, 2023

Commits on Jul 19, 2023

  1. CSI: serialize controller RPCs per plugin

    The CSI specification says that we "SHOULD" send no more than one in-flight
    request per *volume* at a time, with an allowance for losing state
    (ex. leadership transitions) which the plugins "SHOULD" handle gracefully. We
    mostly succesfully serialize node and controller RPCs for the same volume,
    except when Nomad clients are lost.
    (See also container-storage-interface/spec#512)
    
    These concurrency requirements in the spec fall short because Storage Provider
    APIs aren't necessarily safe to call concurrently on the same host. For example,
    concurrently attaching AWS EBS volumes to an EC2 instance results in a race for
    device names, which results in failure to attach and confused results when
    releasing claims. So in practice many CSI plugins rely on k8s-specific sidecars
    for serializing storage provider API calls globally. As a result, we have to be
    much more conservative about concurrency in Nomad than the spec allows.
    
    This changeset includes two major changes to fix this:
    * Add a serializer method to the CSI volume RPC handler. When the
      RPC handler makes a destructive CSI Controller RPC, we send the RPC thru this
      serializer and only one RPC is sent at a time. Any other RPCs in flight will
      block.
    * Ensure that requests go to the same controller plugin instance whenever
      possible by sorting by lowest client ID out of the healthy plugin instances.
    
    Fixes: #15415
    tgross committed Jul 19, 2023
    Configuration menu
    Copy the full SHA
    299d897 View commit details
    Browse the repository at this point in the history

Commits on Jul 20, 2023

  1. Configuration menu
    Copy the full SHA
    2ae8711 View commit details
    Browse the repository at this point in the history