Volume is not released during node drain, causing job to fail forever #8232

mishel170 · 2020-06-20T12:54:36Z

Hey,
Using nomad version 0.11.3 and agents with an AWS ebs volume attached, while trying to drain the node, sometimes (happens very often) the volume can not be released from the ec2 instance causing the following error during rescheduling:

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: volume "vol-0e5f3d9ecd9a" is already published at node "i-0cf3069676ehgy5i2" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists

Operating system and Environment details

Amazon Linux 2 AMI 2.0.20190313 x86_64 HVM gp2

Best regards,
Mishel Liberman

The text was updated successfully, but these errors were encountered:

tgross · 2020-06-22T14:05:22Z

Hi @mishel170! A couple questions so that I can pin down the behavior:

Are the node plugins being run as system jobs?
Is this happening in cases where the controller plugin was on the drained node?

tgross · 2020-06-22T18:02:04Z

Possibly related: #8080

mishel170 · 2020-06-22T19:20:47Z

Possibly related: #8080

Yes it seems to be the similar
Although I haven't seen that there is some race condition of attach and detach..
attaching controllers logs:

  
I0622 19:00:58.428362       1 driver.go:62] Driver: ebs.csi.aws.com Version: v0.6.0-dirty
I0622 19:01:01.540337       1 driver.go:132] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0622 19:01:03.453981       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:01:27.901009       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0e5f3d9ecd9abf8ec NodeId:i-01861d5a3a944f4ca VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:01:32.091877       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0e5f3d9ecd9abf8ec
E0622 19:01:32.091900       1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0622 19:01:33.455003       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:03.455835       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:32.119677       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0e5f3d9ecd9abf8ec NodeId:i-05faceeb37923be3a VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:33.205197       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0e5f3d9ecd9abf8ec
E0622 19:02:33.205225       1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0622 19:02:33.456683       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:03:03.457589       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:03:33.458441       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:04:03.459431       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Hi @mishel170! A couple questions so that I can pin down the behavior:

Are the node plugins being run as system jobs?

Is this happening in cases where the controller plugin was on the drained node?

They run as system jobs and it happens when the controller runs on the same node

Thanks for helping

tgross · 2020-06-23T13:49:42Z

@mishel170 I'm still investigating how to fix this, but in the meantime I do want to give you what I think will work as a workaround for this class of problem.

Run multiple controller plugins, spread across the cluster such that a node that drains always has a controller to talk to.
Run the node plugins as system jobs, as you're doing
Use the -ignore-system flag while node draining to ensure that the node plugin stays on the host even after it's done draining so as to avoid the race with the volumewatcher.

mishel170 · 2020-06-29T08:34:24Z

thanks @tgross

Unfortunately running the controller as system job didn't help
I am detaching manually the volume on each failover of a node until the problem is fixed as a workaround

Looking forward to an update,
thanks.

tgross · 2020-07-28T15:13:03Z

It looks like there's a couple different problems in the node drain scenario:

The entire NodeUnpublish, NodeUnstage, ControllerUnpublish workflow runs async to the draining (or termination) of a client node (in volumewatcher). Whereas even if we have the node plugin running as a system job, the node plugin will be stopped as soon as the volume-claiming allocation stops and we won't be able to send it detach RPCs.
Also, the controller detach RPC needs the external node ID, so we look that up from the node plugin's fingerprint. Unfortunately, this is both racy with the previous step and prevents us from performing a cleanup controller detachment after the fact.
Migrating allocations during a node drain waits for the allocation to be placed on the new node before shutting down the drained allocation. This will always result in an error and a restart of the new allocation, even if we fix everything else. Note that for the typical case where a volume can only be claimed by a single allocation, migrating the application to a new node will always create an outage for the application no matter what we could come up with Nomad. So avoiding the error is a nice-to-have in that circumstance but it doesn't fundamentally change anything.

What I'm attempting to do to fix this:

in the client's csi_hook.go Postrun step:
- ~~trigger the NodeUnpublish and NodeUnstage steps locally,~~ (these need to happen on the server as well)
- then send an RPC back to the server that triggers a ControllerUnpublish. This will be a synchronous call so that we can block the node drain until it's complete
update our CSIVolumeClaim struct to include the external node ID, so that we don't have to rely on the node plugin to be running to execute ControllerUnpublish.
This will need to include a few modifications to volumewatcher to change how we handle claim changes as well.
We'll still need the volume GC job to clean up in cases where the node is just gone and never coming back. In that scenario, we'll need to be able to gracefully handle errors from the missing node plugins and the controller if the underlying host is gone as well.

tgross · 2020-08-07T19:35:44Z

I wanted to give a little status update on where this is. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:

csi: retry controller client RPCs on next controller #8561 retries controller RPCs so that we take advantage of controllers deployed in a HA configuration.
csi: add unpublish RPC #8572 csi: update volumewatcher to use unpublish RPC #8579 are some improved plumbing that makes volume claim reaping synchronous in common cases, which reduces the number of places where things can go wrong (and makes it easier to reason about).
csi: release claims via csi_hook postrun unpublish RPC #8580 uses that plumbing to drive the volume claim unpublish step from the client, so that in most cases (except when we lose touch with the client) we're running the volume unpublish synchronously as part of allocation shutdown.
CSI unpublish error handling improvements #8605 improves our error handling so that the checkpointing we do will work correctly by ignoring "you already did that" errors.
CSI: fix missing ACL tokens for leader-driven RPCs #8607 fixes some missing ACLs and Region flags in the Nomad leader

I've tested this out and the results are... better? Not as good as I'd like yet. As explained in #8232 (comment), the first replacement allocation that gets placed during a node drain fails and we can't do much about that at the moment. That error looks like this:

2020-08-07T14:55:35-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: controller publish: attach volume: rpc error: controller attach volume: volume "vol-07b2ea432ef9a8dce" is already published at node "i-0685ed47dc21535db" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists

When this happens, we're still in the process of detaching the volume from the other node. But the next allocation should succeed to place but instead we're getting an error at first. But the error recovers after several attempts. From the logs I suspect we're seeing an error because the kernel hasn't picked up that the storage provider attached the device yet (or the EC2 API hasn't made the device available yet, in this case) . Because there's only a single claim on the volume we're going thru the whole controller detach cycle with each attempt. I'll be investigating that on Monday to see if we can improve the situation at all.

In the example below, the drained node is 6f643c28:

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
3f8119ed  03e77a2f  group       0        run      running   1m7s ago   52s ago
078de3a2  03e77a2f  group       0        stop     failed    3m7s ago   1m7s ago
b5edab42  03e77a2f  group       0        stop     failed    4m7s ago   3m7s ago
3938e036  03e77a2f  group       0        stop     failed    4m43s ago  4m7s ago
8745a66d  6f643c28  group       0        stop     complete  5m39s ago  4m24s ago

Allocations b5edab42 and 078de3a2 get the following error at the client from the NodeStageVolume RPC to the plugin:

2020-08-07T14:56:05-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided

Logs from the node plugin for the first failed allocation 3938e036:

I0807 18:57:05.467988       1 node.go:94] NodeStageVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:57:05.468081       1 node.go:143] NodeStageVolume: volume="vol-07b2ea432ef9a8dce" operation finished
I0807 18:57:05.468105       1 node.go:145] donedone
E0807 18:57:05.468113       1 driver.go:109] GRPC error: rpc error: code = InvalidArgument desc = Device path not provided

vs the successful run for allocation 3f8119ed:

I0807 18:59:09.316753       1 node.go:94] NodeStageVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[devicePath:/dev/xvdba] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:09.316876       1 node.go:158] NodeStageVolume: find device path /dev/xvdba -> /dev/xvdba
I0807 18:59:09.317095       1 node.go:193] NodeStageVolume: formatting /dev/xvdba and mounting at /csi/staging/ebs-vol0/rw-file-system-single-node-writer with fstype ext4
I0807 18:59:09.317109       1 mount_linux.go:274] Checking for issues with fsck on disk: /dev/xvdba
I0807 18:59:09.340800       1 mount_linux.go:293] Attempting to mount disk: ext4 /dev/xvdba /csi/staging/ebs-vol0/rw-file-system-single-node-writer
I0807 18:59:09.340836       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o defaults /dev/xvdba /csi/staging/ebs-vol0/rw-file-system-single-node-writer])
I0807 18:59:09.349363       1 node.go:143] NodeStageVolume: volume="vol-07b2ea432ef9a8dce" operation finished
I0807 18:59:09.349465       1 node.go:145] donedone
I0807 18:59:09.351076       1 node.go:279] NodePublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[devicePath:/dev/xvdba] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer TargetPath:/csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:09.351129       1 node.go:440] NodePublishVolume: creating dir /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer
I0807 18:59:09.351282       1 node.go:450] NodePublishVolume: mounting /csi/staging/ebs-vol0/rw-file-system-single-node-writer at /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer with option [bind] as fstype ext4
I0807 18:59:09.351299       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o bind /csi/staging/ebs-vol0/rw-file-system-single-node-writer /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer])
I0807 18:59:09.353033       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o bind,remount /csi/staging/ebs-vol0/rw-file-system-single-node-writer /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer])

There are no other errors at the client The controller logs all seem fine.

controller logs a634b3c5

initial failed attempt:

I0807 18:55:35.405415       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
E0807 18:55:35.405445       1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0807 18:55:47.297624       1 controller.go:292] ControllerUnpublishVolume: volume vol-07b2ea432ef9a8dce detached from node i-01c7526848c0ba28a
I0807 18:55:47.304175       1 controller.go:292] ControllerUnpublishVolume: volume vol-07b2ea432ef9a8dce detached from node i-01c7526848c0ba28a

successful attempt from controller:

I0807 18:59:05.493856       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce NodeId:i-0685ed47dc21535db VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:06.179736       1 cloud.go:364] AttachVolume volume="vol-07b2ea432ef9a8dce" instance="i-0685ed47dc21535db" request returned {
  AttachTime: 2020-08-07 18:59:06.142 +0000 UTC,
  Device: "/dev/xvdba",
  InstanceId: "i-0685ed47dc21535db",
  State: "attaching",
  VolumeId: "vol-07b2ea432ef9a8dce"
}
I0807 18:59:09.310673       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
I0807 18:59:09.310710       1 controller.go:268] ControllerPublishVolume: volume vol-07b2ea432ef9a8dce attached to node i-0685ed47dc21535db through device /dev/xvdba

controller logs 7c4c4267

successful attempt:

I0807 18:54:33.408268       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce NodeId:i-01c7526848c0ba28a VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:54:33.942156       1 cloud.go:364] AttachVolume volume="vol-07b2ea432ef9a8dce" instance="i-01c7526848c0ba28a" request returned {
  AttachTime: 2020-08-07 18:54:33.904 +0000 UTC,
  Device: "/dev/xvdba",
  InstanceId: "i-01c7526848c0ba28a",
  State: "attaching",
  VolumeId: "vol-07b2ea432ef9a8dce"
}
I0807 18:54:40.425670       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
I0807 18:54:40.425698       1 controller.go:268] ControllerPublishVolume: volume vol-07b2ea432ef9a8dce attached to node i-01c7526848c0ba28a through device /dev/xvdba

tgross · 2020-08-07T20:27:29Z

I've also opened #8609 to cover the issue that we're not going to be able to fix in the 0.12.x timeframe.

tgross · 2020-08-11T18:30:58Z

Ok, one last item that isn't going to make it into the 0.12.2 release but can be worked around is that if you have HA controllers you need to make sure they don't land on the same client during node drains: #8628

But otherwise testing is showing that the work that's landed for 0.12.2 should close this out. That will be released shortly.

github-actions · 2022-11-03T02:33:53Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added stage/needs-investigation theme/storage labels Jun 22, 2020

tgross self-assigned this Jun 22, 2020

tgross mentioned this issue Jun 22, 2020

CSI: Migrating Allocation w/ CSI Volume to New Node Fails #8080

Closed

tgross added type/bug theme/drain and removed stage/needs-investigation labels Jun 23, 2020

tgross mentioned this issue Jun 23, 2020

[CSI] "volume in use" error deregistering volume previously associated with now stopped job #8057

Closed

This was referenced Jul 29, 2020

csi: retry controller client RPCs on next controller #8561

Merged

csi: add unpublish RPC #8572

Merged

csi: update volumewatcher to use unpublish RPC #8579

Merged

csi: release claims via csi_hook postrun unpublish RPC #8580

Merged

This was referenced Aug 7, 2020

CSI unpublish error handling improvements #8605

Merged

docs: always use -ignore-system on node drain with CSI #8606

Merged

tgross closed this as completed Aug 11, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volume is not released during node drain, causing job to fail forever #8232

Volume is not released during node drain, causing job to fail forever #8232

mishel170 commented Jun 20, 2020

tgross commented Jun 22, 2020

tgross commented Jun 22, 2020

mishel170 commented Jun 22, 2020

tgross commented Jun 23, 2020

mishel170 commented Jun 29, 2020

tgross commented Jul 28, 2020 •

edited

Loading

tgross commented Aug 7, 2020

tgross commented Aug 7, 2020

tgross commented Aug 11, 2020

github-actions bot commented Nov 3, 2022

Volume is not released during node drain, causing job to fail forever #8232

Volume is not released during node drain, causing job to fail forever #8232

Comments

mishel170 commented Jun 20, 2020

Operating system and Environment details

tgross commented Jun 22, 2020

tgross commented Jun 22, 2020

mishel170 commented Jun 22, 2020

tgross commented Jun 23, 2020

mishel170 commented Jun 29, 2020

tgross commented Jul 28, 2020 • edited Loading

tgross commented Aug 7, 2020

tgross commented Aug 7, 2020

tgross commented Aug 11, 2020

github-actions bot commented Nov 3, 2022

tgross commented Jul 28, 2020 •

edited

Loading