Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: reorder controller volume detachment #12387

Merged
merged 2 commits into from
Mar 29, 2022
Merged

Commits on Mar 28, 2022

  1. fix log formatting in csi_hook

    tgross committed Mar 28, 2022
    Configuration menu
    Copy the full SHA
    91ead5d View commit details
    Browse the repository at this point in the history
  2. CSI: reorder controller volume detachment

    In #12112 and #12113 we solved for the problem of races in releasing
    volume claims, but there was a case that we missed. During a node
    drain with a controller attach/detach, we can hit a race where we call
    controller publish before the unpublish has completed. This is
    discouraged in the spec but plugins are supposed to handle it
    safely. But if the storage provider's API is slow enough and the
    plugin doesn't handle the case safely, the volume can get "locked"
    into a state where the provider's API won't detach it cleanly.
    
    Check the claim before making any external controller publish RPC
    calls so that Nomad is responsible for the canonical information about
    whether a volume is currently claimed.
    
    This has a couple side-effects that also had to get fixed here:
    
    * Changing the order means that the volume will have a past claim
      without a valid external node ID because it came from the client, and
      this uncovered a separate bug where we didn't assert the external node
      ID was valid before returning it. Fallthrough to getting the ID from
      the plugins in the state store in this case. We avoided this
      originally because of concerns around plugins getting lost during node
      drain but now that we've fixed that we may want to revisit it in
      future work.
    * We should make sure we're handling `FailedPrecondition` cases from
      the controller plugin the same way we handle other retryable cases.
    * Several tests had to be updated because they were assuming we fail
      in a particular order that we're no longer doing.
    tgross committed Mar 28, 2022
    Configuration menu
    Copy the full SHA
    2639770 View commit details
    Browse the repository at this point in the history