Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI error releasing volume claims #12346

Closed
ygersie opened this issue Mar 22, 2022 · 12 comments · Fixed by #12350
Closed

CSI error releasing volume claims #12346

ygersie opened this issue Mar 22, 2022 · 12 comments · Fixed by #12350
Assignees

Comments

@ygersie
Copy link
Contributor

ygersie commented Mar 22, 2022

Nomad version

1.2.6

Issue

After a sequence of instance refreshes where nomad clients were replaced by a new image the CSI volumes are in a completely unusable state. For some reason the nomad plugin status shows unhealthy controllers but the controller allocations report that the plugins are healthy. The Nomad server is throwing errors about it not being able to release volumes claims because:

    2022-03-22T12:28:58.895Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=mynamespace volume_id=us-west-2c-myvolume[2]
  error=
  | 1 error occurred:
  | 	* could not detach from node: Unknown node 9ef129e1-16bd-5590-85f1-2f5e5c7fd7ab
  |

This is happening to all volumes in the cluster. It looks like Nomad's volumes watcher does not detect disappeared (down and removed) nodes and now it's stuck trying to detach volumes from non existing nodes.

It seems the only way to workaround this is by stopping all jobs, force deregistering all volumes and reregister them.

Reproduction steps

Setup some jobs with volumes. Refresh the clients nodes by shutting them down without draining the node first.

@tgross tgross self-assigned this Mar 22, 2022
@tgross
Copy link
Member

tgross commented Mar 22, 2022

Hi @ygersie! Nomad 1.3.0 will have some improvements in this behavior around the plugin counts.

It looks like Nomad's volumes watcher does not detect disappeared (down and removed) nodes and now it's stuck trying to detach volumes from non existing nodes.

Yeah the problem is that we need to handle the case where the node comes back, but I suspect we're being too aggressive there will nil nodes, which we probably should assume will not come back. The tricky bit is that just because a node is gone that doesn't actually mean the volume is physically detached from that host. You can shut down a Nomad client and GC it, without ever unmounting the volume. That would make all future claims for the volume fail. Fortunately Nomad 1.3.0 will also come with some improvements that'll make that case safer for us to handle as well.

In any case, you should always be able to remove the volume claim from the Nomad state store by running nomad volume detach :volume_id :node_id. If that's not the case, that's a bug to fix for sure.

@ygersie
Copy link
Contributor Author

ygersie commented Mar 22, 2022

Hey @tgross thanks for the very quick response. I was under the impression that the count issue was fixed in 1.2.6 but good to know that this should be addressed in 1.3.0. I see the problem wrt detecting if a node actually has the volume detached but is this something Nomad should take into consideration? Wouldn't the respective storage backend take care of that? The detach won't work as the node can't be found, see:

$ nomad volume detach -namespace mynamespace myvolume 3e5cb67e-dcaa-a067-ad82-2e15fd0b0298
Error detaching volume: Unexpected response code: 500 (rpc error: 2 errors occurred:
	* could not detach from node: Unknown node 3e5cb67e-dcaa-a067-ad82-2e15fd0b0298
	* could not detach from node: Unknown node 3e5cb67e-dcaa-a067-ad82-2e15fd0b0298

)

The only thing that works is to force a reregister.

@tgross
Copy link
Member

tgross commented Mar 22, 2022

I see the problem wrt detecting if a node actually has the volume detached but is this something Nomad should take into consideration? Wouldn't the respective storage backend take care of that?

Yes, but only if the host itself has been turned down. Consider a slightly silly example where I'm running on AWS EC2 and attach an EBS volume via CSI. Then I shut off Nomad and decide to reuse the EC2 host for something else entirely rather than terminating it. If Nomad sends a controller unpublish RPC before the node unpublish RPC succeeds, it's violating the CSI spec. But more importantly if the node hasn't actually unmounted the volume from a node unpublish RPC, those controller unpublish RPCs will just fail anyways. (In the case of AWS EBS this presents as the volume being marked as "detaching" forever in the AWS API.)

That being said, the tasks for me here:

  • Let the nil node case get handled as if the node unpublish has succeeded. In the weird edge case I've described above, the controller unpublish will fail or (if the plugin has no controller) the next claim will fail. In most other cases, we'll go ahead and detach the volume and release the claim.
  • The nomad volume detach should have worked here so that's just a plain ol' bug and I'll fix that ASAP.

@tgross
Copy link
Member

tgross commented Mar 22, 2022

So it looks like both of these cases are the same problem, and should already be fixed. If I take a look at (*CSIVolume) nodeUnpublishVolumeImpl where we're getting this error, we're specifically supposed to be handling this case, right?

	if err != nil {
		// we should only get this error if the Nomad node disconnects and
		// is garbage-collected, so at this point we don't have any reason
		// to operate as though the volume is attached to it.
		if !errors.Is(err, structs.ErrUnknownNode) {
			return fmt.Errorf("could not detach from node: %w", err)
		}
	}

But as it turns out, this is a RPC call and not a normal function call, which destroys the error wrapping information we have during serialization. So the RPC callee is returning this error but it's getting mangled into a flat error string by the time it comes up to us in the RPC caller! I've got a patch fix working locally and I'll push it up for review shortly.

@ygersie
Copy link
Contributor Author

ygersie commented Mar 22, 2022

Awesome, thanks a lot for the quick action! Very much looking forward to 1.3.0 :)

@ygersie
Copy link
Contributor Author

ygersie commented May 23, 2022

Hey @tgross it seems I'm still having issues with this post 1.3.0 upgrade. I frequently have people report issues with allocations getting stuck in pending status where nomad client constantly reports:

    2022-05-23T19:01:03.472Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.101.123.190:4647
    2022-05-23T19:01:03.472Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: volume max claims reached" rpc=CSIVolume.Claim server=10.101.123.190:4647

Interestingly that server isn't even the nomad leader, not sure if that's an issue.

I've also noticed that the nomad volume detach doesn't work with a down node. To reproduce, get an example job running on a node and shutdown the instance. The allocation will be "lost" because the node has been marked down. The volume claim will not be released automatically, at least not for a long time. The nomad volume status initially shows the allocation that had the claim so I can at least, until the next GC which will remove that info, see which node it was running on. But then, trying a nomad volume detach leads to:

$ nomad volume detach ygersie[1] 0a44d722
Error detaching volume: Unexpected response code: 500 (rpc error: 1 error occurred:
	* could not detach from node: No path to node

)

When the node has been garbage collected we get:

$ nomad volume detach ygersie[1] 0a44d722
Error detaching volume: Unexpected response code: 500 (rpc error: missing external node ID: node lookup failed: index error: UUID must be 36 characters)

In this state the rescheduled allocation will remain in pending state because there's still a claim associated with the lost node. I can't perform a detach, so the only thing I can do is force a volume deregister which is far from ideal in our setup with terraform. After that performing a nomad alloc restart <alloc> will make the allocation recover.

Once Nomad has garbage collected it makes matters worse as then I can't even see the volume had an open claim, only the nomad client logs will inform me of that.

I'm still not sure what Nomad should do to make this recovery work automatically. It seems like there's an alpha feature introduced in the CSI spec containing ControllerGetVolumeResponse which may be of help in the future but it's not implemented in the EBS plugin yet

Next to having the requirement to be able to solve this without a volume re-registration, at this point I wonder how hard it would be to make it configurable to just release a volume claim after <N> amount of time. That would help a lot.

@tgross
Copy link
Member

tgross commented May 24, 2022

Interestingly that server isn't even the nomad leader, not sure if that's an issue.

That's not unexpected. Nodes always talk to one particular server (but with failover if the connection is lost), and those requests are forwarded to a leader as needed. That prevents us from requiring full mesh connectivity from every node to every server. Some requests from node to server allow for stale data, so that reduces traffic by quite a bit.

I've also noticed that the nomad volume detach doesn't work with a down node.

Correct. There's no way for the server to send a node unpublish command to the node plugin that's running on a down node. What's becoming clear here is that the CSI specification as written cannot guarantee correct behavior in the case of downed nodes, so what K8s apparently does is just gives up and lets you potentially corrupt your data in the multi-writer volume use case. User expectations seem to agree, so we may need to rework the design here to make it possible to discard claims for lost nodes and just assume that they're never coming back. This is why we built stop_after_client_disconnect but maybe we'll need to enforce that setting for CSI-claiming tasks unless specifically disabled by the user.

I'm going to work up a new issue covering this problem in a bit more detail so that we can get the work planned. In the meantime, you can reduce the risk of this kind of thing by draining nodes before shutting them down.

@ygersie
Copy link
Contributor Author

ygersie commented May 24, 2022

Ideally the CSI plugin would expose the volume's state so the CO can make the decision based on authoritative data if a volume is claimed or not the moment a node goes down and the scheduler lost an allocation with a volume. Right now Nomad holds a claim on a volume which isn't attached to any node and there's no way to fix it other than forcing a re-registration and kicking the associated allocation. That's the biggest problem at the moment, I have no predictable way to fix job allocations getting stuck as, combined with the placement failure that happens sporadically, I end up with a broken job that fails to get placed.

@ygersie
Copy link
Contributor Author

ygersie commented May 24, 2022

I understand why the nomad volume detach won't work properly when the node is down but would it be an option to have a -force option or some other command to force Nomad to release a claim?

@tgross
Copy link
Member

tgross commented Jun 6, 2022

Need to reopen this temporarily for tracking. The issue in #12346 (comment) will get described in a new issue but I don't want to lose track of this until I've done that.

@tgross
Copy link
Member

tgross commented Jun 6, 2022

Closed again, in lieu of #13264 tracking the current work to be done.

@tgross tgross closed this as completed Jun 6, 2022
@github-actions
Copy link

github-actions bot commented Oct 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants