-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI error releasing volume claims #12346
Comments
Hi @ygersie! Nomad 1.3.0 will have some improvements in this behavior around the plugin counts.
Yeah the problem is that we need to handle the case where the node comes back, but I suspect we're being too aggressive there will In any case, you should always be able to remove the volume claim from the Nomad state store by running |
Hey @tgross thanks for the very quick response. I was under the impression that the count issue was fixed in 1.2.6 but good to know that this should be addressed in 1.3.0. I see the problem wrt detecting if a node actually has the volume detached but is this something Nomad should take into consideration? Wouldn't the respective storage backend take care of that? The detach won't work as the node can't be found, see:
The only thing that works is to force a reregister. |
Yes, but only if the host itself has been turned down. Consider a slightly silly example where I'm running on AWS EC2 and attach an EBS volume via CSI. Then I shut off Nomad and decide to reuse the EC2 host for something else entirely rather than terminating it. If Nomad sends a controller unpublish RPC before the node unpublish RPC succeeds, it's violating the CSI spec. But more importantly if the node hasn't actually unmounted the volume from a node unpublish RPC, those controller unpublish RPCs will just fail anyways. (In the case of AWS EBS this presents as the volume being marked as "detaching" forever in the AWS API.) That being said, the tasks for me here:
|
So it looks like both of these cases are the same problem, and should already be fixed. If I take a look at if err != nil {
// we should only get this error if the Nomad node disconnects and
// is garbage-collected, so at this point we don't have any reason
// to operate as though the volume is attached to it.
if !errors.Is(err, structs.ErrUnknownNode) {
return fmt.Errorf("could not detach from node: %w", err)
}
} But as it turns out, this is a RPC call and not a normal function call, which destroys the error wrapping information we have during serialization. So the RPC callee is returning this error but it's getting mangled into a flat error string by the time it comes up to us in the RPC caller! I've got a patch fix working locally and I'll push it up for review shortly. |
Awesome, thanks a lot for the quick action! Very much looking forward to 1.3.0 :) |
Hey @tgross it seems I'm still having issues with this post 1.3.0 upgrade. I frequently have people report issues with allocations getting stuck in pending status where nomad client constantly reports: 2022-05-23T19:01:03.472Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.101.123.190:4647
2022-05-23T19:01:03.472Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.101.123.190:4647 Interestingly that server isn't even the nomad leader, not sure if that's an issue. I've also noticed that the $ nomad volume detach ygersie[1] 0a44d722
Error detaching volume: Unexpected response code: 500 (rpc error: 1 error occurred:
* could not detach from node: No path to node
) When the node has been garbage collected we get: $ nomad volume detach ygersie[1] 0a44d722
Error detaching volume: Unexpected response code: 500 (rpc error: missing external node ID: node lookup failed: index error: UUID must be 36 characters) In this state the rescheduled allocation will remain in Once Nomad has garbage collected it makes matters worse as then I can't even see the volume had an open claim, only the nomad client logs will inform me of that. I'm still not sure what Nomad should do to make this recovery work automatically. It seems like there's an alpha feature introduced in the CSI spec containing Next to having the requirement to be able to solve this without a volume re-registration, at this point I wonder how hard it would be to make it configurable to just release a volume claim after <N> amount of time. That would help a lot. |
That's not unexpected. Nodes always talk to one particular server (but with failover if the connection is lost), and those requests are forwarded to a leader as needed. That prevents us from requiring full mesh connectivity from every node to every server. Some requests from node to server allow for stale data, so that reduces traffic by quite a bit.
Correct. There's no way for the server to send a node unpublish command to the node plugin that's running on a down node. What's becoming clear here is that the CSI specification as written cannot guarantee correct behavior in the case of downed nodes, so what K8s apparently does is just gives up and lets you potentially corrupt your data in the multi-writer volume use case. User expectations seem to agree, so we may need to rework the design here to make it possible to discard claims for lost nodes and just assume that they're never coming back. This is why we built I'm going to work up a new issue covering this problem in a bit more detail so that we can get the work planned. In the meantime, you can reduce the risk of this kind of thing by draining nodes before shutting them down. |
Ideally the CSI plugin would expose the volume's state so the CO can make the decision based on authoritative data if a volume is claimed or not the moment a node goes down and the scheduler lost an allocation with a volume. Right now Nomad holds a claim on a volume which isn't attached to any node and there's no way to fix it other than forcing a re-registration and kicking the associated allocation. That's the biggest problem at the moment, I have no predictable way to fix job allocations getting stuck as, combined with the placement failure that happens sporadically, I end up with a broken job that fails to get placed. |
I understand why the nomad volume detach won't work properly when the node is down but would it be an option to have a |
Need to reopen this temporarily for tracking. The issue in #12346 (comment) will get described in a new issue but I don't want to lose track of this until I've done that. |
Closed again, in lieu of #13264 tracking the current work to be done. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
1.2.6
Issue
After a sequence of instance refreshes where nomad clients were replaced by a new image the CSI volumes are in a completely unusable state. For some reason the
nomad plugin status
shows unhealthy controllers but the controller allocations report that the plugins are healthy. The Nomad server is throwing errors about it not being able to release volumes claims because:This is happening to all volumes in the cluster. It looks like Nomad's volumes watcher does not detect disappeared (down and removed) nodes and now it's stuck trying to detach volumes from non existing nodes.
It seems the only way to workaround this is by stopping all jobs, force deregistering all volumes and reregister them.
Reproduction steps
Setup some jobs with volumes. Refresh the clients nodes by shutting them down without draining the node first.
The text was updated successfully, but these errors were encountered: