Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume is not released during node drain, causing job to fail forever #8232

Closed
mishel170 opened this issue Jun 20, 2020 · 10 comments
Closed

Comments

@mishel170
Copy link

Hey,
Using nomad version 0.11.3 and agents with an AWS ebs volume attached, while trying to drain the node, sometimes (happens very often) the volume can not be released from the ec2 instance causing the following error during rescheduling:

failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: volume "vol-0e5f3d9ecd9a" is already published at node "i-0cf3069676ehgy5i2" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists

Operating system and Environment details

Amazon Linux 2 AMI 2.0.20190313 x86_64 HVM gp2

Best regards,
Mishel Liberman

@tgross
Copy link
Member

tgross commented Jun 22, 2020

Hi @mishel170! A couple questions so that I can pin down the behavior:

  • Are the node plugins being run as system jobs?
  • Is this happening in cases where the controller plugin was on the drained node?

@tgross
Copy link
Member

tgross commented Jun 22, 2020

Possibly related: #8080

@mishel170
Copy link
Author

Possibly related: #8080

Yes it seems to be the similar
Although I haven't seen that there is some race condition of attach and detach..
attaching controllers logs:

  
I0622 19:00:58.428362       1 driver.go:62] Driver: ebs.csi.aws.com Version: v0.6.0-dirty
I0622 19:01:01.540337       1 driver.go:132] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0622 19:01:03.453981       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:01:27.901009       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0e5f3d9ecd9abf8ec NodeId:i-01861d5a3a944f4ca VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:01:32.091877       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0e5f3d9ecd9abf8ec
E0622 19:01:32.091900       1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0622 19:01:33.455003       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:03.455835       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:32.119677       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-0e5f3d9ecd9abf8ec NodeId:i-05faceeb37923be3a VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:02:33.205197       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-0e5f3d9ecd9abf8ec
E0622 19:02:33.205225       1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0622 19:02:33.456683       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:03:03.457589       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:03:33.458441       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0622 19:04:03.459431       1 controller.go:298] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Hi @mishel170! A couple questions so that I can pin down the behavior:

  • Are the node plugins being run as system jobs?
  • Is this happening in cases where the controller plugin was on the drained node?

They run as system jobs and it happens when the controller runs on the same node

Thanks for helping

@tgross
Copy link
Member

tgross commented Jun 23, 2020

@mishel170 I'm still investigating how to fix this, but in the meantime I do want to give you what I think will work as a workaround for this class of problem.

  • Run multiple controller plugins, spread across the cluster such that a node that drains always has a controller to talk to.
  • Run the node plugins as system jobs, as you're doing
  • Use the -ignore-system flag while node draining to ensure that the node plugin stays on the host even after it's done draining so as to avoid the race with the volumewatcher.

@mishel170
Copy link
Author

thanks @tgross

Unfortunately running the controller as system job didn't help
I am detaching manually the volume on each failover of a node until the problem is fixed as a workaround

Looking forward to an update,
thanks.

@tgross
Copy link
Member

tgross commented Jul 28, 2020

It looks like there's a couple different problems in the node drain scenario:

  • The entire NodeUnpublish, NodeUnstage, ControllerUnpublish workflow runs async to the draining (or termination) of a client node (in volumewatcher). Whereas even if we have the node plugin running as a system job, the node plugin will be stopped as soon as the volume-claiming allocation stops and we won't be able to send it detach RPCs.
  • Also, the controller detach RPC needs the external node ID, so we look that up from the node plugin's fingerprint. Unfortunately, this is both racy with the previous step and prevents us from performing a cleanup controller detachment after the fact.
  • Migrating allocations during a node drain waits for the allocation to be placed on the new node before shutting down the drained allocation. This will always result in an error and a restart of the new allocation, even if we fix everything else. Note that for the typical case where a volume can only be claimed by a single allocation, migrating the application to a new node will always create an outage for the application no matter what we could come up with Nomad. So avoiding the error is a nice-to-have in that circumstance but it doesn't fundamentally change anything.

What I'm attempting to do to fix this:

  • in the client's csi_hook.go Postrun step:
    • trigger the NodeUnpublish and NodeUnstage steps locally, (these need to happen on the server as well)
    • then send an RPC back to the server that triggers a ControllerUnpublish. This will be a synchronous call so that we can block the node drain until it's complete
  • update our CSIVolumeClaim struct to include the external node ID, so that we don't have to rely on the node plugin to be running to execute ControllerUnpublish.
  • This will need to include a few modifications to volumewatcher to change how we handle claim changes as well.
  • We'll still need the volume GC job to clean up in cases where the node is just gone and never coming back. In that scenario, we'll need to be able to gracefully handle errors from the missing node plugins and the controller if the underlying host is gone as well.

@tgross
Copy link
Member

tgross commented Aug 7, 2020

I wanted to give a little status update on where this is. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:

I've tested this out and the results are... better? Not as good as I'd like yet. As explained in #8232 (comment), the first replacement allocation that gets placed during a node drain fails and we can't do much about that at the moment. That error looks like this:

2020-08-07T14:55:35-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: controller publish: attach volume: rpc error: controller attach volume: volume "vol-07b2ea432ef9a8dce" is already published at node "i-0685ed47dc21535db" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists

When this happens, we're still in the process of detaching the volume from the other node. But the next allocation should succeed to place but instead we're getting an error at first. But the error recovers after several attempts. From the logs I suspect we're seeing an error because the kernel hasn't picked up that the storage provider attached the device yet (or the EC2 API hasn't made the device available yet, in this case) . Because there's only a single claim on the volume we're going thru the whole controller detach cycle with each attempt. I'll be investigating that on Monday to see if we can improve the situation at all.

In the example below, the drained node is 6f643c28:

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
3f8119ed  03e77a2f  group       0        run      running   1m7s ago   52s ago
078de3a2  03e77a2f  group       0        stop     failed    3m7s ago   1m7s ago
b5edab42  03e77a2f  group       0        stop     failed    4m7s ago   3m7s ago
3938e036  03e77a2f  group       0        stop     failed    4m43s ago  4m7s ago
8745a66d  6f643c28  group       0        stop     complete  5m39s ago  4m24s ago

Allocations b5edab42 and 078de3a2 get the following error at the client from the NodeStageVolume RPC to the plugin:

2020-08-07T14:56:05-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided

Logs from the node plugin for the first failed allocation 3938e036:

I0807 18:57:05.467988       1 node.go:94] NodeStageVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:57:05.468081       1 node.go:143] NodeStageVolume: volume="vol-07b2ea432ef9a8dce" operation finished
I0807 18:57:05.468105       1 node.go:145] donedone
E0807 18:57:05.468113       1 driver.go:109] GRPC error: rpc error: code = InvalidArgument desc = Device path not provided

vs the successful run for allocation 3f8119ed:

I0807 18:59:09.316753       1 node.go:94] NodeStageVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[devicePath:/dev/xvdba] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:09.316876       1 node.go:158] NodeStageVolume: find device path /dev/xvdba -> /dev/xvdba
I0807 18:59:09.317095       1 node.go:193] NodeStageVolume: formatting /dev/xvdba and mounting at /csi/staging/ebs-vol0/rw-file-system-single-node-writer with fstype ext4
I0807 18:59:09.317109       1 mount_linux.go:274] Checking for issues with fsck on disk: /dev/xvdba
I0807 18:59:09.340800       1 mount_linux.go:293] Attempting to mount disk: ext4 /dev/xvdba /csi/staging/ebs-vol0/rw-file-system-single-node-writer
I0807 18:59:09.340836       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o defaults /dev/xvdba /csi/staging/ebs-vol0/rw-file-system-single-node-writer])
I0807 18:59:09.349363       1 node.go:143] NodeStageVolume: volume="vol-07b2ea432ef9a8dce" operation finished
I0807 18:59:09.349465       1 node.go:145] donedone
I0807 18:59:09.351076       1 node.go:279] NodePublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce PublishContext:map[devicePath:/dev/xvdba] StagingTargetPath:/csi/staging/ebs-vol0/rw-file-system-single-node-writer TargetPath:/csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:09.351129       1 node.go:440] NodePublishVolume: creating dir /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer
I0807 18:59:09.351282       1 node.go:450] NodePublishVolume: mounting /csi/staging/ebs-vol0/rw-file-system-single-node-writer at /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer with option [bind] as fstype ext4
I0807 18:59:09.351299       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o bind /csi/staging/ebs-vol0/rw-file-system-single-node-writer /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer])
I0807 18:59:09.353033       1 mount_linux.go:135] Mounting cmd (mount) with arguments ([-t ext4 -o bind,remount /csi/staging/ebs-vol0/rw-file-system-single-node-writer /csi/per-alloc/3f8119ed-2c92-d46d-54bf-ee8da6042eec/ebs-vol0/rw-file-system-single-node-writer])

There are no other errors at the client The controller logs all seem fine.

controller logs a634b3c5

initial failed attempt:

I0807 18:55:35.405415       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
E0807 18:55:35.405445       1 driver.go:109] GRPC error: rpc error: code = AlreadyExists desc = Resource already exists
I0807 18:55:47.297624       1 controller.go:292] ControllerUnpublishVolume: volume vol-07b2ea432ef9a8dce detached from node i-01c7526848c0ba28a
I0807 18:55:47.304175       1 controller.go:292] ControllerUnpublishVolume: volume vol-07b2ea432ef9a8dce detached from node i-01c7526848c0ba28a

successful attempt from controller:

I0807 18:59:05.493856       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce NodeId:i-0685ed47dc21535db VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:59:06.179736       1 cloud.go:364] AttachVolume volume="vol-07b2ea432ef9a8dce" instance="i-0685ed47dc21535db" request returned {
  AttachTime: 2020-08-07 18:59:06.142 +0000 UTC,
  Device: "/dev/xvdba",
  InstanceId: "i-0685ed47dc21535db",
  State: "attaching",
  VolumeId: "vol-07b2ea432ef9a8dce"
}
I0807 18:59:09.310673       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
I0807 18:59:09.310710       1 controller.go:268] ControllerPublishVolume: volume vol-07b2ea432ef9a8dce attached to node i-0685ed47dc21535db through device /dev/xvdba
controller logs 7c4c4267

successful attempt:

I0807 18:54:33.408268       1 controller.go:229] ControllerPublishVolume: called with args {VolumeId:vol-07b2ea432ef9a8dce NodeId:i-01c7526848c0ba28a VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Readonly:false Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 18:54:33.942156       1 cloud.go:364] AttachVolume volume="vol-07b2ea432ef9a8dce" instance="i-01c7526848c0ba28a" request returned {
  AttachTime: 2020-08-07 18:54:33.904 +0000 UTC,
  Device: "/dev/xvdba",
  InstanceId: "i-01c7526848c0ba28a",
  State: "attaching",
  VolumeId: "vol-07b2ea432ef9a8dce"
}
I0807 18:54:40.425670       1 manager.go:197] Releasing in-process attachment entry: /dev/xvdba -> volume vol-07b2ea432ef9a8dce
I0807 18:54:40.425698       1 controller.go:268] ControllerPublishVolume: volume vol-07b2ea432ef9a8dce attached to node i-01c7526848c0ba28a through device /dev/xvdba

@tgross
Copy link
Member

tgross commented Aug 7, 2020

I've also opened #8609 to cover the issue that we're not going to be able to fix in the 0.12.x timeframe.

@tgross
Copy link
Member

tgross commented Aug 11, 2020

Ok, one last item that isn't going to make it into the 0.12.2 release but can be worked around is that if you have HA controllers you need to make sure they don't land on the same client during node drains: #8628

But otherwise testing is showing that the work that's landed for 0.12.2 should close this out. That will be released shortly.

@tgross tgross closed this as completed Aug 11, 2020
@github-actions
Copy link

github-actions bot commented Nov 3, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants