Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CSI] "volume in use" error deregistering volume previously associated with now stopped job #8057

Closed
holtwilkins opened this issue May 27, 2020 · 26 comments

Comments

@holtwilkins
Copy link
Contributor

Getting a "volume in use" error when trying to deregister, long after the volume has stopped being in use. Possibly a recurrence or variation of #7625, but this time my whole cluster is running Nomad 0.11.2, whereas that issue was on a 0.11.0 RC.

Nomad version

v0.11.2

Operating system and Environment details

Ubuntu 16

Issue

I have stopped jobs, which have been stopped for over a day, but they're still somehow preventing me from deregistering the CSI volume that was associated with them.

Nomad Client logs (if appropriate)

[hashidev] hwilkins ~ $ nomad volume deregister test_default_1a
Error deregistering volume: Unexpected response code: 500 (volume in use: test_default_1a)
[hashidev] hwilkins ~ $ nomad volume status test_default_1a
ID                   = test_default_1a
Name                 = test_default_1a
External ID          = vol-<SCRUB>
Plugin ID            = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Schedulable          = false
Controllers Healthy  = 0
Controllers Expected = 0
Nodes Healthy        = 0
Nodes Expected       = 0
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID        Node ID   Task Group           Version  Desired  Status    Created   Modified
090982bb  fed5596e  csi-ebs-hello-world  0        run      complete  1d3h ago  1d3h ago
[hashidev] hwilkins ~ $ nomad alloc status 090982bb
No allocation(s) with prefix or id "090982bb" found
[hashidev] hwilkins ~ $ nomad status csi-ebs-hello-world
Multiple matches found for id "csi-ebs-hello-world"

Volumes:
csi-ebs-hello-world-ebs_1a, csi-ebs-hello-world_1a
[hashidev] hwilkins ~ $ nomad status csi-ebs-hello-world-ebs_1a
ID                   = csi-ebs-hello-world-ebs_1a
Name                 = csi-ebs-hello-world-ebs_1a
External ID          = vol-<SCRUB>
Plugin ID            = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Schedulable          = false
Controllers Healthy  = 0
Controllers Expected = 0
Nodes Healthy        = 0
Nodes Expected       = 0
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID        Node ID   Task Group           Version  Desired  Status    Created   Modified
7eba8625  fed5596e  csi-ebs-hello-world  0        run      complete  1d1h ago  1d1h ago
[hashidev] hwilkins ~ $ nomad status csi-ebs-hello-world_1a
ID                   = csi-ebs-hello-world_1a
Name                 = csi-ebs-hello-world_1a
External ID          = vol-<SCRUB>
Plugin ID            = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Schedulable          = false
Controllers Healthy  = 0
Controllers Expected = 0
Nodes Healthy        = 0
Nodes Expected       = 0
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID        Node ID   Task Group           Version  Desired  Status    Created   Modified
af81bebc  fed5596e  csi-ebs-hello-world  0        run      complete  1d2h ago  1d2h ago

I've also verified via the AWS console that the volume itself is not attached to an instance, but in the available state.

Nomad Server logs (if appropriate)

May 27 04:58:50 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T04:58:50.174Z [ERROR] nomad.fsm: CSIVolumeDeregister failed: error="volume in use: test_default_1a"
May 27 04:58:50 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T04:58:50.174Z [ERROR] http: request failed: method=DELETE path=/v1/volume/csi/test_default_1a error="volume in use: test_default_1a" code=500
May 27 04:58:50 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T04:58:50.174Z [DEBUG] http: request complete: method=DELETE path=/v1/volume/csi/test_default_1a duration=1.07008ms
@holtwilkins
Copy link
Contributor Author

I also just saw the -purge flag and tried that:

[hashidev] hwilkins ~ $ nomad status csi-ebs-hello-world-ebs_1a
nomaID                   = csi-ebs-hello-world-ebs_1a
Name                 = csi-ebs-hello-world-ebs_1a
External ID          = vol-<SCRUB>
Plugin ID            = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Schedulable          = false
Controllers Healthy  = 0
Controllers Expected = 0
Nodes Healthy        = 0
Nodes Expected       = 0
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID        Node ID   Task Group           Version  Desired  Status    Created   Modified
7eba8625  fed5596e  csi-ebs-hello-world  0        run      complete  1d1h ago  1d1h ago
[hashidev] hwilkins ~ $ nomad stop -purge csi-ebs-hello-world-ebs_1a
No job(s) with prefix or id "csi-ebs-hello-world-ebs_1a" found

It's also probably worth noting that the node in question where these old allocs were held had been deployed with 0.11.1, but they've since been cycled-out and replaced by nodes running 0.11.2

[hashidev] hwilkins ~ $ nomad node status fed5596e

error fetching node stats: Unexpected response code: 404 (No path to node)
ID              = fed5596e-968a-a260-e819-0b7abf8f64bf
Name            = hashidev-worker-<SCRUB>
Class           = <none>
DC              = us-east-1
Drain           = false
Eligibility     = ineligible
Status          = down
CSI Controllers = aws-ebs0
CSI Drivers     = aws-ebs0
Host Volumes    = <none>
CSI Volumes     = <none>
Driver Status   = docker

Node Events
Time                  Subsystem  Message
2020-05-26T05:12:58Z  Cluster    Node heartbeat missed
2020-05-26T05:12:23Z  CSI        failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
2020-05-26T05:12:22Z  CSI        failed fingerprinting with error: rpc error: code = Canceled desc = context canceled
2020-05-26T05:12:22Z  Drain      Node drain complete
2020-05-26T05:12:21Z  Drain      Node drain strategy set
2020-05-26T04:00:11Z  Storage    Unmount volume
2020-05-26T04:00:04Z  Storage    Mount volume
2020-05-26T03:00:07Z  Storage    Unmount volume
2020-05-26T03:00:04Z  Storage    Mount volume
2020-05-26T02:40:09Z  Storage    Unmount volume

Allocated Resources
CPU         Memory       Disk
0/4250 MHz  0 B/6.4 GiB  0 B/65 GiB

Allocation Resource Utilization
CPU         Memory
0/4250 MHz  0 B/6.4 GiB

error fetching node stats: actual resource usage not present

Allocations
No allocations placed

So I'm just looking for a way to forcefully get rid of the reference to this volume first, and second to lodge this as a bug :).

I've also tried forcing nomad system gc many hours ago, so it doesn't seem to have cleared it up either.

@tgross
Copy link
Member

tgross commented May 27, 2020

I'm a little confused by the CLI outputs you've provided here. It looks like you're showing me three volumes:

  • test_default_1a, claimed by alloc 090982bb
  • csi-ebs-hello-world-ebs_1a, claimed by alloc 7eba8625
  • csi-ebs-hello-world_1a, claimed by alloc af81bebc

But then you're trying to run a job stop on the volume, which isn't an operation supported on volumes:

$ nomad stop -purge csi-ebs-hello-world-ebs_1a
No job(s) with prefix or id "csi-ebs-hello-world-ebs_1a" found

By any chance do the jobs have the same name as the volume? I'm wondering if using nomad status :id without the more specific subcommands nomad volume status :id or nomad job status :id is confusing matters here.


Moving on to some observations:

I've also verified via the AWS console that the volume itself is not attached to an instance, but in the available state.

It looks like all 3 allocs that claim these 3 volumes are complete, so having the volume unattached is a good sign. Do you have any logs from when the volumes were detached? That might give us some insight.

This is a little weird:

$ nomad alloc status 090982bb
No allocation(s) with prefix or id "090982bb" found

I'm not sure how we could get into a state where the job would be GC'd out from under us but the allocation could still be queried by the volume API to give us information like complete status. What I'd want to look at is the job that are claiming the volumes, to see what they are showing for those allocations.

2020-05-26T05:12:23Z CSI failed fingerprinting with error: rpc error: code = Canceled desc = context canceled

Is the aws-ebs0 plugin still running and healthy on all the nodes? Can you provide a nomad plugin status aws-ebs0?

@holtwilkins
Copy link
Contributor Author

Hey @tgross , thanks for the response.

You're right, that there are 3 volumes that all claim the same volume ID. I was playing with trying to develop my automation, and ended-up creating 3 different volumes for the same EBS volume.

The job was previously called simply csi-ebs-hello-world, I don't think I ever created a job with the other suffixes, so those are probably only in the volume names.

[hashidev] hwilkins ~ $ nomad plugin status aws-ebs0
ID                   = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 4
Nodes Expected       = 4

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
8b9cd937  ff38c034  nodes       0        run      running  13h29m ago  13h29m ago
e2430e2c  f1b03f8f  nodes       0        run      running  13h29m ago  13h29m ago
94c3e6b7  00f7f888  nodes       0        run      running  13h29m ago  13h29m ago
81d32416  d733c1a3  nodes       0        run      running  13h29m ago  13h29m ago
e1c0f03b  f1b03f8f  controller  11       run      running  13h29m ago  13h29m ago
62504f57  00f7f888  controller  11       run      running  13h29m ago  13h29m ago

However, note that this plugin was broken for some time as I transitioned it to use Nomad ACLs (which I may be doing incorrectly since how to do that isn't documented - the only doc I could find assumed you didn't have Nomad ACLs). I fixed it around 12h ago, and I still can't deregister the volume(s). I'm now wondering if this has something to do with ACLs. I found some more logs floating around on the Nomad leader as I attempt to deregister the volume:

May 27 21:04:35 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:35.726Z [ERROR] nomad.fsm: CSIVolumeDeregister failed: error="volume in use: csi-ebs-hello-world_1a"
May 27 21:04:35 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:35.726Z [ERROR] http: request failed: method=DELETE path=/v1/volume/csi/csi-ebs-hello-world_1a error="volume in use: csi-ebs-hello-world_1a" code=500
May 27 21:04:35 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:35.726Z [DEBUG] http: request complete: method=DELETE path=/v1/volume/csi/csi-ebs-hello-world_1a duration=1.269438ms
May 27 21:04:36 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:36.764Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=131.481µs
May 27 21:04:38 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:38.777Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=123.815µs
May 27 21:04:40 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:40.549Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=148.313µs
May 27 21:04:41 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:41.768Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=129.428µs
May 27 21:04:43 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:43.778Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=121.493µs
May 27 21:04:44 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:44.547Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:33592
May 27 21:04:46 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:46.770Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=131.599µs
May 27 21:04:48 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:48.781Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=117.261µs
May 27 21:04:50 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:50.550Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=135.271µs
May 27 21:04:51 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:51.217Z [DEBUG] worker: dequeued evaluation: eval_id=8d257055-07ce-ced6-a1e1-ae7f47ebcaf6
May 27 21:04:51 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:51.217Z [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=1733620 csi_plugin_gc_threshold=1h0m0s
May 27 21:04:51 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:51.217Z [ERROR] core.sched: failed to GC plugin: plugin_id=aws-ebs0 error="Permission denied"
May 27 21:04:51 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:51.217Z [ERROR] worker: error invoking scheduler: error="failed to process evaluation: Permission denied"
May 27 21:04:51 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:51.217Z [DEBUG] worker: nack evaluation: eval_id=8d257055-07ce-ced6-a1e1-ae7f47ebcaf6
May 27 21:04:51 hashidev-nomad-<SCRUB> nomad[8186]:     2020-05-27T21:04:51.217Z [WARN]  nomad: eval reached delivery limit, marking as failed: eval="<Eval "8d257055-07ce-ced6-a1e1-ae7f47ebcaf6" JobID: "csi-plugin-gc" Namespace: "-">"

I also see some of this:

May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.412Z [DEBUG] worker: dequeued evaluation: eval_id=9e17f245-8036-101c-a6f7-1ec0a12b61e2
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.412Z [DEBUG] core.sched: node GC scanning before cutoff index: index=1721801 node_gc_threshold=24h0m0s
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.412Z [DEBUG] worker: ack evaluation: eval_id=9e17f245-8036-101c-a6f7-1ec0a12b61e2
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.414Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=csi-ebs-hello-world-ebs_1a error="1 error occurred:
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:         * could not detach from node: node lookup failed: index error: UUID must be 36 characters
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]: "
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.414Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=csi-ebs-hello-world_1a error="1 error occurred:
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:         * could not detach from node: node lookup failed: index error: UUID must be 36 characters
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]: "
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.414Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=test_default_1a error="1 error occurred:
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:         * could not detach from node: node lookup failed: index error: UUID must be 36 characters
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]: "
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.416Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=test_default_1a error="1 error occurred:
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:         * could not detach from node: node lookup failed: index error: UUID must be 36 characters
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]: "
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.416Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=csi-ebs-hello-world-ebs_1a error="1 error occurred:
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:         * could not detach from node: node lookup failed: index error: UUID must be 36 characters
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]: "
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.416Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=csi-ebs-hello-world_1a error="1 error occurred:
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:         * could not detach from node: node lookup failed: index error: UUID must be 36 characters
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]: "
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.417Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=test_default_1a error="1 error occurred:
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:         * could not detach from node: node lookup failed: index error: UUID must be 36 characters
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]: "
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.417Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=csi-ebs-hello-world-ebs_1a error="1 error occurred:
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:         * could not detach from node: node lookup failed: index error: UUID must be 36 characters
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]: "
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:     2020-05-27T21:00:46.418Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=csi-ebs-hello-world_1a error="1 error occurred:
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]:         * could not detach from node: node lookup failed: index error: UUID must be 36 characters
May 27 21:00:46 hashidev-nomad-<SCRUB>> nomad[8186]: "

@holtwilkins
Copy link
Contributor Author

Hey @tgross , I've sent my plugin job definitions to nomad-oss-debug. Here's my Nomad anonymous ACL policy

namespace "default" {
  policy = "read"

  capabilities = [
    # I want to get rid of this line
    "csi-mount-volume",
    "submit-job",
  ]
}

# View status of Nomad servers
agent {
  policy = "read"
}

# View status of Nomad clients
node {
  policy = "read"
}

# manage Raft cluster status
operator {
  policy = "read"
}

# manage namespace resource governing
quota {
  policy = "read"
}

# read plugins, like the CSI plugin
plugin {
  policy = "read"
}

@holtwilkins
Copy link
Contributor Author

Hey @tgross , any thoughts on how to workaround this bug? I can't seem to get Nomad to "let go" of these volume definitions because it's still tracking those allocations. And of course none of the allocations exist when I run nomad alloc status on them.

Not sure if it's helpful or not, but I also tried deploying a job to use this volume, since it's theoretically defined, and here's the relevant placement failure:

Constraint "CSI volume csi-ebs-hello-world-ebs_1a has exhausted its available writer claims": 1 nodes excluded by filter

Note that I've also recycled all nodes in the cluster at this point. The only thing I haven't done is totally destroy & recreate the Nomad raft DB from my deployment scripts. I'm fairly confident that will fix this, but I'd really like to figure out how to fix this in case we hit this in production at some point. However, if I can't get this fixed in the next day, a clean wipe of the DB is what I'm going to have to do so I can keep going on my automation here.

@holtwilkins
Copy link
Contributor Author

I got this to happen again. It seems like the problem is related to draining a node. As in, my job using the EBS volume was running on node x, say. I issue a drain on node x, and the job tries to move to node y. However, the node plugin job (which is a system job as prescribed) didn't unmount the volume from the node. Then when I terminated the drained node, nomad still thinks that the volume's writer claims are exhausted, even though it's not claimed by any node (since the node originally claiming it has been destroyed).

Constraint "CSI volume csi-ebs-staticvolume-alt-service-test has exhausted its available writer claims": 1 nodes excluded by filter

@holtwilkins
Copy link
Contributor Author

[hashidev] hwilkins ~ $ aws ec2 describe-volumes --volume-ids vol-00b...
{
    "Volumes": [
        {
            "Attachments": [],
            "AvailabilityZone": "us-east-1c",
            "CreateTime": "2020-05-18T02:39:41.645000+00:00",
            "Encrypted": true,
            "KmsKeyId": "arn:aws:kms:us-east-1:<NUM>:key/<KEY>",
            "Size": 1,
            "SnapshotId": "",
            "State": "available",
            "VolumeId": "vol-00b...",
            "Iops": 100,
            "Tags": [
                {
                    "Key": "Hashistack",
                    "Value": "hashidev..."
                },
                {
                    "Key": "JobName",
                    "Value": "csi-ebs-test-default"
                }
            ],
            "VolumeType": "gp2",
            "MultiAttachEnabled": false
        }
    ]
}
[hashidev] hwilkins ~ $ nomad volume status csi-ebs-staticvolume-hello-world-vol1
ID                   = csi-ebs-staticvolume-hello-world-vol1
Name                 = csi-ebs-staticvolume-hello-world-vol1
External ID          = vol-00b...
Plugin ID            = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v0.6.0-dirty
Schedulable          = false
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 0
Nodes Expected       = 0
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID        Node ID   Task Group                Version  Desired  Status    Created     Modified
80da1a5b  7c981673  staticvolume-hello-world  7        stop     complete  12h46m ago  4h7m ago
17596645  0e1c4bcf  staticvolume-hello-world  7        stop     complete  18h9m ago   12h46m ago
[hashidev] hwilkins ~ $ nomad volume deregister csi-ebs-staticvolume-hello-world-vol1
Error deregistering volume: Unexpected response code: 500 (rpc error: volume in use: csi-ebs-staticvolume-hello-world-vol1)

@agrrh
Copy link

agrrh commented Jun 20, 2020

Faced similar issue when trying to deregister absent volume.

Here's how to reproduce:

  • Run CSI plugin jobs
  • Properly register volume
  • Run some job claiming it
  • Stop it with -purge flag
  • Remove CSI plugin jobs
  • Remove volume resource

Nomad thinks my volume is still there in UI/cli and I'm not able to deregister it:

$ nomad system gc  # has no effect

$ nomad volume status 
Container Storage Interface
ID      Name    Plugin ID          Schedulable  Access Mode
mysql0  mysql0  csi.hetzner.cloud  false        single-node-writer

$ nomad volume deregister mysql0
Error deregistering volume: Unexpected response code: 500 (volume in use: mysql0)

Obviously my bad for doing things in wrong order, but still would be glad to have something like nomad volume deregister -force flag for such cases.

@tgross tgross self-assigned this Jun 22, 2020
@tgross
Copy link
Member

tgross commented Jun 22, 2020

@holtwilkins I'm looking into this, but in the meanwhile Nomad 0.11.3 was shipped with some improvements to the volume claim reaping (ref https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#0113-june-5-2020) so it might be worth your while to check that out.

@holtwilkins
Copy link
Contributor Author

Thanks @tgross , I've reproduced this same issue on Nomad 0.11.3 shortly after it was released, unfortunately.

@tgross
Copy link
Member

tgross commented Jun 23, 2020

It seems like the problem is related to draining a node. As in, my job using the EBS volume was running on node x, say. I issue a drain on node x, and the job tries to move to node y. However, the node plugin job (which is a system job as prescribed) didn't unmount the volume from the node.

Ok, so this is probably related to #8232 and #8080

The advice I gave to folks in those issues applies here as well, but that only helps prevent the issue in the future by reducing the risk of triggering the race condition between plugin teardown and volume claim reaping:

  • Run multiple controller plugins, spread across the cluster such that a node that drains always has a controller to talk to.
  • Run the node plugins as system jobs, as you're doing
  • Use the -ignore-system flag while node draining to ensure that the node plugin stays on the host even after it's done draining so as to avoid the race with the volumewatcher.

@tgross
Copy link
Member

tgross commented Jun 23, 2020

I've also opened #8251 for the nomad volume deregister -force flag suggestion, which is something I've been meaning to do for a while.

@wasilak
Copy link

wasilak commented Jun 23, 2020

Hi,

I've also encountered same problem. We have nomad client nodes in ASG and when they got rotated/replaced we've stuck with multiple clients in "ineligible" state even though they are long terminated. Removing job using volume, volume itself and plugin did not help.

Probably worth mentioning is fact, that as part of autoscaling clients are auto drained by lambda before termination proceeds and it worked as expected.

Every time GC is run or one of the servers restarted there are multiple messages like this:
Jun 23 15:44:26 nomad[8303]: 2020-06-23T15:44:26.510Z [ERROR] nomad.fsm: DeleteNode failed: error="csi plugin delete failed: csi_plugins missing plugin aws-ebs0"

I've even changed defaults in servers to:

node_gc_threshold = "30s"
csi_plugin_gc_threshold = "30s"

but it just made messages to appear more often, not only when forcing GC via command line.

Cheers

@jfcantu
Copy link

jfcantu commented Jun 26, 2020

Is there a workaround for "unsticking" volumes stuck in limbo like this? I have one that's currently attached to a five-days-dead job that I don't know how to reclaim.

@jfcantu
Copy link

jfcantu commented Jul 8, 2020

Is there a workaround for "unsticking" volumes stuck in limbo like this? I have one that's currently attached to a five-days-dead job that I don't know how to reclaim.

Update: my five-days-dead job is now over two weeks dead, and the EBS volume is still attached.

To be blunt, I don't know that CSI support can even be called "beta" at this point. The current bugs are severe enough to make CSI effectively unusable, and it doesn't appear from the release notes that any progress was made on it in 0.12.

@saez0pub
Copy link

saez0pub commented Jul 8, 2020

Hello, I'm facing a similar issue, is there a release for the force option on nomad 0.11 ?
It seems that http://localhost:4646/v1/volume/csi/id_of_the_volume does not update the write allocs on non existents allocations
It prevents any job of using this volume
The only solution I had was to register the same volume with another id

@saez0pub
Copy link

saez0pub commented Jul 8, 2020

I have the same message and can't detach attach my volume which has write allocations linked in volume API but theses allocations does not exists in allocation API:

could not detach from node: node lookup failed: index error: UUID must be 36 characters

error seems to come from https://github.com/hashicorp/nomad/blob/master/nomad/state/state_store.go#L723
throwed by https://github.com/hashicorp/nomad/blob/master/nomad/volumewatcher/volume_watcher.go#L312

Could it be possible NodeID being empty preventing from detaching the volume removing the lost allocations ?
NodeID being could be provoked by a lost node ?

EDIT : my volume isn't attached it is just not claimable

@tgross
Copy link
Member

tgross commented Jul 8, 2020

Is there a workaround for "unsticking" volumes stuck in limbo like this?

The nomad volume deregister -force option was merged for 0.12.0-rc1 but unfortunately it looks like there's a bug in it that isn't going to make it into the GA (see #8380)

To be blunt, I don't know that CSI support can even be called "beta" at this point. The current bugs are severe enough to make CSI effectively unusable, and it doesn't appear from the release notes that any progress was made on it in 0.12.

I really can't disagree with you there, @jfcantu. The CSI project had been my primary focus but we were pretty short-handed for it during development. As you might note from some of the other open issues with the storage tag there's constant struggle with the spec itself and plugins that may not be bug-free themselves. I'll admit I'm not particularly proud of the shape it's in. Unfortunately I had to focus on some other projects for the 0.12.0 release cycle, but post-0.12.0 GA I've got some time carved out to spend on this to get the rest of these bugs worked out.

is there a release for the force option on nomad 0.11 ?

There is not and we won't be backporting that.

@jfcantu
Copy link

jfcantu commented Jul 8, 2020

I really can't disagree with you there, @jfcantu. The CSI project had been my primary focus but we were pretty short-handed for it during development. As you might note from some of the other open issues with the storage tag there's constant struggle with the spec itself and plugins that may not be bug-free themselves. I'll admit I'm not particularly proud of the shape it's in. Unfortunately I had to focus on some other projects for the 0.12.0 release cycle, but post-0.12.0 GA I've got some time carved out to spend on this to get the rest of these bugs worked out.

That's great info. Thanks for the update! I'm eagerly anticipating the next steps.

@saez0pub
Copy link

saez0pub commented Jul 10, 2020

Hello,

I have just updated to 0.12.0, it seems your fix is in.
I have most of my ghost volumes removed 🎉

However, I am still unable to remove this ghost allocation in volume locked state

# nomad volume status csi-disk-0
ID                   = csi-disk-0
Name                 = csi-disk-0
External ID          = projects/myproject/zones/myzone/disks/data-csi-disk-0
Plugin ID            = gce-pd
Provider             = pd.csi.storage.gke.io
Version              = v0.7.0-gke.0
Schedulable          = true
Controllers Healthy  = 2
Controllers Expected = 2
Nodes Healthy        = 2
Nodes Expected       = 2
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = <none>
Namespace            = default

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
88cfffe3  b1042488  csi-disk  47       run      running  3d4h ago   2d2h ago
bed001f0  511b5f41  csi-disk  47       stop     lost     3d14h ago  3d4h ago
# nomad alloc-status 88cfffe3
No allocation(s) with prefix or id "88cfffe3" found
# nomad node status b1042488
No node(s) with prefix "b1042488" found
# nomad volume deregister csi-disk-0
Error deregistering volume: Unexpected response code: 500 (volume in use: csi-disk-0)
# nomad volume deregister -force csi-disk-0
Are you sure you want to force deregister volume "csi-disk-0"? [y/N] y
Error deregistering volume: Unexpected response code: 500 (volume in use: csi-disk-0)

Corresponding log

     2020-07-10T07:09:57.713Z [ERROR] nomad.fsm: CSIVolumeDeregister failed: error="volume in use: csi-disk-0"
    2020-07-10T07:09:57.714Z [ERROR] http: request failed: method=DELETE path=/v1/volume/csi/csi-disk-0?force=true error="volume in use: csi-disk-0" code=500

@tgross
Copy link
Member

tgross commented Jul 10, 2020

Ok, that's an interesting state: the allocation doesn't exist at all but is showing running status for the volume. That gives me a new clue on where to look.

@tgross
Copy link
Member

tgross commented Aug 7, 2020

Wanted to give a quick status update. I've landed a handful of PRs that will be released as part of the upcoming 0.12.2 release:

I believe these fixes combined should get us into pretty good shape, and #8584 will give you an escape hatch to manually detach the volume via nomad volume detach once that's merged.

@tgross
Copy link
Member

tgross commented Aug 10, 2020

For sake of our planning, I'm going to close this issue. We'll continue to track progress of this set of problems in #8100.

@hongkongkiwi
Copy link

I'm still seeing this issue on 12.4. Even all allocations are failed or stopped (or I do garbage collect so they don't exist anymore), the volume status still shows up as these ghost allocations.

I can deregister if I -force, but it doesn't seem to garbage collect these ghost allocations correctly.

@andkuzmich
Copy link

I see same issue with aws-ebs-csi-driver after upgrading to v1.0.0. I stopped all nodes replaced there nomad binary and started nomad service again. I guess that issue is not reproduced on the new nomad versions because I don't see any issues on GitHub. If my problem in Nomad's environment can someone suggest the correct way to update Nomad on nodes and clean it's environment?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants