Nomad send wrong target_path to CSI plugin during read_only volume unpublish #9259

apollo13 · 2020-11-04T10:58:43Z

Nomad version

0.12.5

Issue

I could not find anything in the 1.0 changelog, so I am reporting this here. I have my own CSI plugin which gets a volume registered:

id = "wiki"
name = "wiki"
type = "csi"
external_id = "nomad/wiki"
plugin_id = "glusterfs"
access_mode = "multi-node-multi-writer"
attachment_mode = "file-system"

Now when I use the volume in a job like this:

job "infra.test" {
  datacenters = ["dc1"]

  group "test" {
    count = 1

    volume "wiki" {
      type      = "csi"
      read_only = true 
      source    = "wiki"
    }


    task "ubuntu" {
      driver = "docker"
      volume_mount {
       volume      = "wiki"
       destination = "/test"
       read_only   = false
      }

      config {
        image = "ubuntu:20.04"
        command = "sleep"
	args = ["infinity"]
      }
    }

  }
}

I get the following logs (from my plugin):

INFO:csi:Starting server on unix://csi/csi.sock
INFO:csi.node:Received mount request for nomad/wiki at /csi/per-alloc/05036ffa-e34e-4aba-26f5-4f83b80812cb/wiki/ro-file-system-multi-node-multi-writer
INFO:csi.node:Received unmount request for nomad/wiki at /csi/per-alloc/05036ffa-e34e-4aba-26f5-4f83b80812cb/wiki/rw-file-system-multi-node-multi-writer
WARNING:csi.node:Target path /csi/per-alloc/05036ffa-e34e-4aba-26f5-4f83b80812cb/wiki/rw-file-system-multi-node-multi-writer does not exist for nomad/wiki, most likely failed during mount.

As you can see in the logs NodePublishVolume got called with a patch containing ro-file-system-multi-node-multi-writer while NodeUnpublishVolume gets called with rw-file-system-multi-node-multi-writer (note the ro/rw asymmetry there). Now I do realize that my job file might be buggy (specifying read_only twice, once for volumes and once for volume_mount), but I think nomad should still send the path it send for mounting also during unmount.

The text was updated successfully, but these errors were encountered:

apollo13 · 2020-11-04T11:00:50Z

Even if I remove read_only completely from volume_mount it sends the unmount request as rw.

apollo13 · 2020-11-04T11:05:09Z

If it helps: My plugin does not support volume staging, it just uses publish/unpublish

tgross · 2020-11-04T13:10:57Z

@apollo13 can you share the plugin job file and the Nomad debug logs for the client around the time of mounting?

apollo13 · 2020-11-04T13:26:02Z

The plugin job:

job "infra.storage" {
  datacenters = ["dc1"]

  type = "system"

  group "node" {
    task "plugin" {
      driver = "docker"

      config {
        image = "internal_image"

        args = [
          "--endpoint=unix://csi/csi.sock",
          "--node-id=${attr.unique.hostname}.domain",
          "--vol-server=${attr.unique.hostname}.domain"
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "glusterfs"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

debug logs (stripped the status/leader query):

2020-11-04T14:23:17.162+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=800.551µs
2020-11-04T14:23:17.225+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:48996
2020-11-04T14:23:17.893+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=105.403µs
2020-11-04T14:23:21.376+0100 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-11-04T14:23:21.713+0100 [DEBUG] client: updated allocations: index=275309 total=11 pulled=1 filtered=10
2020-11-04T14:23:21.713+0100 [DEBUG] client: allocation updates: added=1 removed=0 updated=0 ignored=10
2020-11-04T14:23:21.714+0100 [DEBUG] client: allocation updates applied: added=1 removed=0 updated=0 ignored=10 errors=0
2020-11-04T14:23:21.783+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu path=/opt/paas/bin/nomad args=[/opt/paas/bin/nomad, logmon]
2020-11-04T14:23:21.783+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu path=/opt/paas/bin/nomad pid=18518
2020-11-04T14:23:21.783+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu path=/opt/paas/bin/nomad
2020-11-04T14:23:21.791+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu version=2
2020-11-04T14:23:21.791+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu @module=logmon address=/tmp/plugin999567709 network=unix timestamp=2020-11-04T14:23:21.791+0100
2020-11-04T14:23:21.793+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu path=/opt/paas/data/nomad/alloc/548e9221-7677-be2f-20bf-892d5ee2364d/alloc/logs/.ubuntu.stdout.fifo @module=logmon timestamp=2020-11-04T14:23:21.793+0100
2020-11-04T14:23:21.793+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu @module=logmon path=/opt/paas/data/nomad/alloc/548e9221-7677-be2f-20bf-892d5ee2364d/alloc/logs/.ubuntu.stderr.fifo timestamp=2020-11-04T14:23:21.793+0100
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: cancelling removal of container image: driver=docker image_name=ubuntu:20.04
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: image reference count incremented: driver=docker image_name=ubuntu:20.04 image_id=sha256:d70eaf7277eada08fca944de400e7e4dd97b1262c06ed2b1011500caa4decaf1 references=1
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: configured resources: driver=docker task_name=ubuntu memory=314572800 memory_reservation=0 cpu_shares=100 cpu_quota=0 cpu_period=0
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ubuntu binds="[]string{"/opt/paas/data/nomad/alloc/548e9221-7677-be2f-20bf-892d5ee2364d/alloc:/alloc", "/opt/paas/data/nomad/alloc/548e9221-7677-be2f-20bf-892d5ee2364d/ubuntu/local:/local", "/opt/paas/data/nomad/alloc/548e9221-7677-be2f-20bf-892d5ee2364d/ubuntu/secrets:/secrets"}"
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: networking mode not specified; using default: driver=docker task_name=ubuntu
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: setting container startup command: driver=docker task_name=ubuntu command="sleep infinity"
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=ubuntu labels=map[com.hashicorp.nomad.alloc_id:548e9221-7677-be2f-20bf-892d5ee2364d]
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ubuntu container_name=ubuntu-548e9221-7677-be2f-20bf-892d5ee2364d
2020-11-04T14:23:21.843+0100 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=d6bd0e544078e18583e1aaae32a66d0badbb0608e5a8061c428cf63915347087
2020-11-04T14:23:21.955+0100 [DEBUG] client: updated allocations: index=275312 total=11 pulled=0 filtered=11
2020-11-04T14:23:21.955+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-04T14:23:21.955+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0
2020-11-04T14:23:22.142+0100 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=d6bd0e544078e18583e1aaae32a66d0badbb0608e5a8061c428cf63915347087
2020-11-04T14:23:22.142+0100 [DEBUG] client.driver_mgr.docker.docker_logger: starting plugin: driver=docker path=/opt/paas/bin/nomad args=[/opt/paas/bin/nomad, docker_logger]
2020-11-04T14:23:22.142+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin started: driver=docker path=/opt/paas/bin/nomad pid=18586
2020-11-04T14:23:22.142+0100 [DEBUG] client.driver_mgr.docker.docker_logger: waiting for RPC address: driver=docker path=/opt/paas/bin/nomad
2020-11-04T14:23:22.148+0100 [DEBUG] client.driver_mgr.docker.docker_logger: using plugin: driver=docker version=2
2020-11-04T14:23:22.148+0100 [DEBUG] client.driver_mgr.docker.docker_logger.nomad: plugin address: driver=docker network=unix @module=docker_logger address=/tmp/plugin860008604 timestamp=2020-11-04T14:23:22.147+0100
2020-11-04T14:23:22.148+0100 [DEBUG] client.driver_mgr.docker.docker_logger.nomad: using client connection initialized from environment: driver=docker @module=docker_logger timestamp=2020-11-04T14:23:22.148+0100
2020-11-04T14:23:22.365+0100 [DEBUG] client: updated allocations: index=275313 total=11 pulled=0 filtered=11
2020-11-04T14:23:22.366+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-04T14:23:22.366+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0
2020-11-04T14:23:27.182+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=701.906µs
2020-11-04T14:23:27.225+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:49024
2020-11-04T14:23:27.913+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=88.676µs
2020-11-04T14:23:32.176+0100 [DEBUG] nomad: memberlist: Initiating push/pull sync with: nomad01.global 172.22.3.201:4648
2020-11-04T14:23:32.369+0100 [DEBUG] client: updated allocations: index=275315 total=11 pulled=0 filtered=11
2020-11-04T14:23:32.369+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-04T14:23:32.369+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0
2020-11-04T14:23:36.081+0100 [DEBUG] client: updated allocations: index=275320 total=11 pulled=1 filtered=10
2020-11-04T14:23:36.081+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=10
2020-11-04T14:23:36.082+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=10 errors=0
2020-11-04T14:23:36.371+0100 [DEBUG] client: updated allocations: index=275322 total=11 pulled=0 filtered=11
2020-11-04T14:23:36.371+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-04T14:23:36.371+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0
2020-11-04T14:23:37.202+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=727.032µs
2020-11-04T14:23:37.226+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:49054
2020-11-04T14:23:37.939+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=161.185µs
2020-11-04T14:23:41.342+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=d6bd0e544078e18583e1aaae32a66d0badbb0608e5a8061c428cf63915347087 driver=docker
2020-11-04T14:23:41.346+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/paas/bin/nomad pid=18586
2020-11-04T14:23:41.346+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
2020-11-04T14:23:41.349+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu error="container stopped"
2020-11-04T14:23:41.353+0100 [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:d70eaf7277eada08fca944de400e7e4dd97b1262c06ed2b1011500caa4decaf1 references=0
2020-11-04T14:23:41.354+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d
2020-11-04T14:23:41.356+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu path=/opt/paas/bin/nomad pid=18518
2020-11-04T14:23:41.356+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu
2020-11-04T14:23:41.356+0100 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu
2020-11-04T14:23:41.557+0100 [DEBUG] client: updated allocations: index=275325 total=11 pulled=0 filtered=11
2020-11-04T14:23:41.557+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-04T14:23:41.557+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0

Btw one last time (I hope I keep remembering it then). Those logs are generated via nomad monitor -log-level debug. If you mean other logs please tell me which and I'll reconfigure the server/client accordingly.

tgross · 2020-11-04T21:43:28Z

As you can see in the logs NodePublishVolume got called with a patch containing ro-file-system-multi-node-multi-writer while NodeUnpublishVolume gets called with rw-file-system-multi-node-multi-writer (note the ro/rw asymmetry there).

Ok, I think I figured out what's happening here and it's definitely not just your plugin. In 0.12.2 (in PR #8580 ) we made a change to fix a bug during node drains. The workflow changed slightly so that the client sends that unpublish RPC to the server and then it fires the node unpublish back to the client once it's figured out whether is needs to do anything with a controller. But there's a bug in the logic where we've effectively erased the "ro" information that's available for the claim when we send it up. It shouldn't be a huge lift to fix so I'll make sure to get that into the next 1.0 beta release.

Interestingly, earlier today I was doing some end-to-end testing for #9239 and was finding unexpected behaviors during unpublish that aren't getting picked up by our E2E suite. So I think I have the test case all ready to go!

apollo13 · 2020-11-05T07:48:43Z

Thank you for the investigation, this is great to hear. Between the issues I run into and possible bugs in the CSI plugin it is often hard to determine for me what is at fault :)

tgross · 2020-11-05T13:17:37Z

Between the issues I run into and possible bugs in the CSI plugin it is often hard to determine for me what is at fault

Understood on that. Thanks for trying this all out -- having someone using Nomad who's working on a custom plugin is a great way for us to find lots of issues in the CSI beta!

tgross · 2020-11-11T18:08:36Z

The fix in #9323 will ship in 1.0-beta3 later this week.

github-actions · 2022-10-29T02:36:11Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added theme/storage type/bug stage/needs-investigation labels Nov 4, 2020

tgross self-assigned this Nov 4, 2020

apollo13 mentioned this issue Nov 4, 2020

Nomad CSI plugin is started to often #9248

Closed

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Nov 4, 2020

tgross mentioned this issue Nov 11, 2020

csi: Postrun hook should not change mode #9323

Merged

tgross closed this as completed in #9323 Nov 11, 2020

tgross mentioned this issue Nov 24, 2020

Nomad confused about how many CSI plugins should be running #9371

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad send wrong target_path to CSI plugin during read_only volume unpublish #9259

Nomad send wrong target_path to CSI plugin during read_only volume unpublish #9259

apollo13 commented Nov 4, 2020

apollo13 commented Nov 4, 2020

apollo13 commented Nov 4, 2020

tgross commented Nov 4, 2020

apollo13 commented Nov 4, 2020 •

edited

Loading

tgross commented Nov 4, 2020 •

edited

Loading

apollo13 commented Nov 5, 2020

tgross commented Nov 5, 2020

tgross commented Nov 11, 2020

github-actions bot commented Oct 29, 2022

Nomad send wrong target_path to CSI plugin during read_only volume unpublish #9259

Nomad send wrong target_path to CSI plugin during read_only volume unpublish #9259

Comments

apollo13 commented Nov 4, 2020

Nomad version

Issue

apollo13 commented Nov 4, 2020

apollo13 commented Nov 4, 2020

tgross commented Nov 4, 2020

apollo13 commented Nov 4, 2020 • edited Loading

tgross commented Nov 4, 2020 • edited Loading

apollo13 commented Nov 5, 2020

tgross commented Nov 5, 2020

tgross commented Nov 11, 2020

github-actions bot commented Oct 29, 2022

apollo13 commented Nov 4, 2020 •

edited

Loading

tgross commented Nov 4, 2020 •

edited

Loading