csi: Task unable to run with csi_hook error "Device path not provided" #10432

jrlonan-gt · 2021-04-23T03:27:17Z

Nomad version

Nomad v1.0.2 (4c1d4fc)

Operating system and Environment details

RedHat Linux 4.18.0-240.10.1.el8_3.x86_64
amazon/aws-ebs-csi-driver v0.9.0

Issue

Stateful job fails with error " failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided". Nomad cluster is running in AWS using EBS volumes and CSI plugin v0.9.0

Reproduction steps

Re-run jobspec with the same jobspec several times.

Expected Result

Task running

Actual Result

Task failed to run with the following error
Time Type Description
2021-04-23T10:34:52+08:00 Setup Failure failed to setup alloc: pre-run hook "csi_hook" failed: rpc error: code = InvalidArgument desc = Device path not provided

Job file (if appropriate)

task "prometheus" {
      driver = "docker"

      resources {
        cpu    = 250
        memory = 512
      }

      config {
        image      = "prom/prometheus:v2.25.0"
        force_pull = true
        ports = ["prometheus_ui"]
        network_mode = "host"


        mount {
          type = "volume"
          target = "/prometheus"
          source = "prometheus_2"
          readonly = false
        }

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

I0423 02:34:52.458147 1 node.go:234] NodeUnstageVolume: /csi/staging/prometheus_2/rw-file-system-single-node-writer target not mounted

I0423 02:34:52.444934 1 node.go:100] NodeStageVolume: called with args {VolumeId:vol-0460d38250e2ce4e8 PublishContext:map[] StagingTargetPath:/csi/staging/prometheus_2/rw-file-system-single-node-writer VolumeCapability:mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} 

E0423 02:34:52.445112 1 driver.go:115] GRPC error: rpc error: code = InvalidArgument desc = Device path not provided 

I0423 02:34:52.457871 1 node.go:210] NodeUnstageVolume: called with args {VolumeId:vol-0460d38250e2ce4e8 StagingTargetPath:/csi/staging/prometheus_2/rw-file-system-single-node-writer XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Output: umount: /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer: no mount point specified.

Unmounting arguments: /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer

E0423 02:34:52.457340 1 driver.go:115] GRPC error: rpc error: code = Internal desc = Could not unmount "/csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer": unmount failed: exit status 32

I0423 02:34:52.449659 1 mount_linux.go:202] Unmounting /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer

I0423 02:34:52.449651 1 node.go:378] NodeUnpublishVolume: unmounting /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-write

I0423 02:34:52.449608 1 node.go:367] NodeUnpublishVolume: called with args {VolumeId:vol-0460d38250e2ce4e8 TargetPath:/csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

The text was updated successfully, but these errors were encountered:

tgross · 2021-04-23T11:59:06Z

Hi @jrlonan-gt! The relevant error seems to be this bit:

I0423 02:34:52.457871 1 node.go:210] NodeUnstageVolume: called with args {VolumeId:vol-0460d38250e2ce4e8 StagingTargetPath:/csi/staging/prometheus_2/rw-file-system-single-node-writer XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
Output: umount: /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer: no mount point specified.
Unmounting arguments: /csi/per-alloc/551e73ca-adc7-2423-2515-5bba10d2065a/prometheus_2/rw-file-system-single-node-writer

Can you provide the volume specification you're using here?

jrlonan-gt · 2021-04-26T01:08:36Z

@tgross

Can you provide the volume specification you're using here?

AWS EBS gp2 volume
500 GB
Filesystem: ext4
Access mode: single-node-writer
Attachment Mode: file-system

jrlonan-gt · 2021-04-26T01:18:28Z

FYI, i applied changes to the config and updated the Nomad jobspec for my other job that has identical config (it was working fine), and encountered the same issue as above.

Were there any changes to the plugin within the last 2 weeks or so?

durango · 2021-04-26T14:24:33Z

I have had this same problem on 1.0.2. Tried upgrading to 1.0.4 but still the same error. From the plugin https://github.com/kubernetes-sigs/aws-ebs-csi-driver I've tried versions: v0.9.1 to latest.

Following the mysql examples nomad https://learn.hashicorp.com/tutorials/nomad/stateful-workloads-csi-volumes?in=nomad/stateful-workloads

with no luck still

This link shows where we're hitting the error, I have not debugged the CSI plugin yet

https://discuss.hashicorp.com/t/ebs-csi-driver-csi-hook-failed-device-path-not-provided/22845/6

jrlonan-gt · 2021-04-27T02:50:56Z

Checking the logs, my ebs-controller and ebs-nodes allocations were in a bad state, as at one point the containers were restarted but there was no Plugin became healthy plugin: aws-ebs0 message in the Event logs. Restarting the plugin controller and node jobs fixes the issue for 1 of the jobs

I keep seeing these error logs I0427 02:07:32.822804 1 controller.go:334] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} before and after restarting the controller

jessequinn · 2021-04-27T04:36:41Z

@jrlonan-gt same thing. I can restart the controller and nodes and possibly get ONE of my jobs to work when mounting a EBS volume. However, this doesnt always work. I have the same error logs that you mention for all my nodes.

csi plugin 0.10.1
nomad 1.0.4

What changed recently that caused this?

controller messages

I0427 04:46:56.632120       1 controller.go:360] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:26.632782       1 controller.go:360] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:56.634103       1 controller.go:360] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:48:26.634813       1 controller.go:360] ControllerGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

node messages

I0427 04:46:04.780411       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:46:34.781220       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:04.781932       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:34.782659       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:48:04.784919       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

I0427 04:46:34.681205       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:04.681874       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:34.683278       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:50.678816       1 node.go:105] NodeStageVolume: called with args {VolumeId:vol-0998a88f677e69072 PublishContext:map[] StagingTargetPath:/csi/staging/hml_redis/rw-file-system-single-node-writer VolumeCapability:mount:<> access_mode:<mode:SINGLE_NODE_WRITER >  Secrets:map[] VolumeContext:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:50.678878       1 node.go:157] NodeStageVolume: volume="vol-0998a88f677e69072" operation finished
I0427 04:47:50.678888       1 inflight.go:69] Node Service: volume="vol-0998a88f677e69072" operation finished
E0427 04:47:50.678893       1 driver.go:119] GRPC error: rpc error: code = InvalidArgument desc = Device path not provided
I0427 04:47:50.681986       1 node.go:407] NodeUnpublishVolume: called with args {VolumeId:vol-0998a88f677e69072 TargetPath:/csi/per-alloc/5b6cf0f1-0250-8f1e-9388-0adde2d702e0/hml_redis/rw-file-system-single-node-writer XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:50.682023       1 node.go:425] NodeUnpublishVolume: unmounting /csi/per-alloc/5b6cf0f1-0250-8f1e-9388-0adde2d702e0/hml_redis/rw-file-system-single-node-writer
I0427 04:47:50.683521       1 node.go:421] NodeUnPublishVolume: volume="vol-0998a88f677e69072" operation finished
I0427 04:47:50.683538       1 inflight.go:69] Node Service: volume="vol-0998a88f677e69072" operation finished
E0427 04:47:50.683543       1 driver.go:119] GRPC error: rpc error: code = Internal desc = Could not unmount "/csi/per-alloc/5b6cf0f1-0250-8f1e-9388-0adde2d702e0/hml_redis/rw-file-system-single-node-writer": unmount failed: exit status 32
Unmounting arguments: /csi/per-alloc/5b6cf0f1-0250-8f1e-9388-0adde2d702e0/hml_redis/rw-file-system-single-node-writer
Output: umount: /csi/per-alloc/5b6cf0f1-0250-8f1e-9388-0adde2d702e0/hml_redis/rw-file-system-single-node-writer: no mount point specified.
I0427 04:47:50.683868       1 node.go:236] NodeUnstageVolume: called with args {VolumeId:vol-0998a88f677e69072 StagingTargetPath:/csi/staging/hml_redis/rw-file-system-single-node-writer XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:47:50.684090       1 node.go:268] [Debug] NodeUnstageVolume: /csi/staging/hml_redis/rw-file-system-single-node-writer target not mounted
I0427 04:47:50.684097       1 node.go:251] NodeUnStageVolume: volume="vol-0998a88f677e69072" operation finished
I0427 04:47:50.684103       1 inflight.go:69] Node Service: volume="vol-0998a88f677e69072" operation finished
I0427 04:48:04.684056       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:48:34.685577       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:49:04.686269       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0427 04:49:34.687335       1 node.go:498] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

and example of a volume hcl

# volume registration
type = "csi"
id = "hml_redis"
name = "hml_redis"
external_id = "vol-0998a88f677e69072"
access_mode = "single-node-writer"
attachment_mode = "file-system"
plugin_id = "aws-ebs0"

jrlonan-gt · 2021-04-27T05:47:55Z

@jessequinn For the other job that still failed after I restarted controller and all the node plugins, it returned a different error message (which i didn't capture, unfortunately, but implies the volume is not attached to the correct EC2 instance where the job runs), so I had to manually detach the volume from the Nomad client via AWS console and re-create the Nomad job, and it successfully re-attach the EBS volume to the allocation now.

Still would like to know the root cause, and hoping for a more stable release 🙏

jessequinn · 2021-04-27T10:40:37Z

@jrlonan-gt are you suggesting that the issue comes from the fact that the EBS volumes have not properly been removed from a previous attachment? That could make sense. let me get our cloud team to confirm that the EBSs arent still attached in some manner. However, it is weird that would be the case considering the volumes in nomad are "schedulable".

jessequinn · 2021-04-27T10:46:22Z

@jrlonan-gt btw, i cannot get any EBS volume to work now, even with restarting nodes and controllers, WHEN i have more than 1 job using a volume (not the same volumes).

jessequinn · 2021-04-27T11:02:55Z

FYI, i applied changes to the config and updated the Nomad jobspec for my other job that has identical config (it was working fine), and encountered the same issue as above.

Were there any changes to the plugin within the last 2 weeks or so?

like @durango i tested 0.9.0 to 0.10.1. all the same issues. we tested against nomad 1.0.0 and 1.0.4 same thing.

jessequinn · 2021-04-27T13:26:26Z

in our situation we rebooted the machine and nomad/csi_plugin lost those connections; however, the EBSs were still attached. Detaching them as suggested by @jrlonan-gt worked.

durango · 2021-04-27T14:29:19Z

I have rotated machines through AWS' ASG as well as manually attaching and detaching the EBS drive from each server and client throughout my cluster. Still same issue. This is also a new volume registered under a unique/different ID and external ID. I have had this same issue since 0.11.0.

jessequinn · 2021-04-27T15:50:33Z

I had to restart controllers and nodes, remove volumes and jobs from nomad, and do nomad system gc. In addition, detaching the EBSs.Once completed it all worked out.

durango · 2021-04-27T16:54:20Z

I had to do the following to get this to work today:

Detach and deregister all drives from CSI
Remove all jobs + tasks related to CSI with -purge
nomad system gc
[Important] register the controller + nodes with a new, ever used before:
- plugin ID
- csi directory (and create this new directory within each client)
- add user = "root" to the task stanza
- add AWS_REGION=us-east-1 to the task's env stanza.. I was getting a "metadata not found" error within the logs
setup a new volume with a name, ID, and external ID that has never been introduced into the cluster before

If I used the previous plugin ID or CSI directory, I would run into the same issue (despite detaching/deregistering/purging jobs/etc). Hope this helps others down the road

tgross · 2022-02-03T19:01:16Z

Following up on this unfortunately-neglected issue with a few notes as I start on related work:

Without a healthy controller and node plugin, volume can't be unmounted and detached. CSI: node unmount from the client before unpublish RPC #11892 will help make this tolerate node plugins coming back after failure.
There were some known integrity bugs that makes this scenario hard to recover from. CSI: resolve invalid claim states #11890 will help get the state store back into shape when that happens.

I'm going to close this one out but if we have similar issue come back up post-1.2.5 please feel free to comment or reopen with more info.

github-actions · 2022-10-11T02:44:01Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jrlonan-gt added the type/bug label Apr 23, 2021

jrlonan-gt changed the title ~~csi:~~ csi: Task unable to run with csi_hook error "Device path not provided" Apr 23, 2021

notnoop added the stage/waiting-reply label Apr 25, 2021

notnoop added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 25, 2021

notnoop moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Apr 25, 2021

tgross self-assigned this May 19, 2021

tgross added theme/storage and removed stage/waiting-reply labels May 19, 2021

tgross moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage May 19, 2021

tgross removed their assignment May 19, 2021

tgross added the stage/needs-verification Issue needs verifying it still exists label Feb 3, 2022

tgross closed this as completed Feb 3, 2022

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Feb 3, 2022

tgross removed the stage/needs-verification Issue needs verifying it still exists label Feb 3, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csi: Task unable to run with csi_hook error "Device path not provided" #10432

csi: Task unable to run with csi_hook error "Device path not provided" #10432

jrlonan-gt commented Apr 23, 2021 •

edited

Loading

tgross commented Apr 23, 2021

jrlonan-gt commented Apr 26, 2021

jrlonan-gt commented Apr 26, 2021 •

edited

Loading

durango commented Apr 26, 2021 •

edited

Loading

jrlonan-gt commented Apr 27, 2021

jessequinn commented Apr 27, 2021 •

edited

Loading

jrlonan-gt commented Apr 27, 2021 •

edited

Loading

jessequinn commented Apr 27, 2021

jessequinn commented Apr 27, 2021

jessequinn commented Apr 27, 2021

jessequinn commented Apr 27, 2021 •

edited

Loading

durango commented Apr 27, 2021

jessequinn commented Apr 27, 2021

durango commented Apr 27, 2021

tgross commented Feb 3, 2022

github-actions bot commented Oct 11, 2022

csi: Task unable to run with csi_hook error "Device path not provided" #10432

csi: Task unable to run with csi_hook error "Device path not provided" #10432

Comments

jrlonan-gt commented Apr 23, 2021 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented Apr 23, 2021

jrlonan-gt commented Apr 26, 2021

jrlonan-gt commented Apr 26, 2021 • edited Loading

durango commented Apr 26, 2021 • edited Loading

jrlonan-gt commented Apr 27, 2021

jessequinn commented Apr 27, 2021 • edited Loading

jrlonan-gt commented Apr 27, 2021 • edited Loading

jessequinn commented Apr 27, 2021

jessequinn commented Apr 27, 2021

jessequinn commented Apr 27, 2021

jessequinn commented Apr 27, 2021 • edited Loading

durango commented Apr 27, 2021

jessequinn commented Apr 27, 2021

durango commented Apr 27, 2021

tgross commented Feb 3, 2022

github-actions bot commented Oct 11, 2022

jrlonan-gt commented Apr 23, 2021 •

edited

Loading

jrlonan-gt commented Apr 26, 2021 •

edited

Loading

durango commented Apr 26, 2021 •

edited

Loading

jessequinn commented Apr 27, 2021 •

edited

Loading

jrlonan-gt commented Apr 27, 2021 •

edited

Loading

jessequinn commented Apr 27, 2021 •

edited

Loading