Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad send wrong target_path to CSI plugin during read_only volume unpublish #9259

Closed
apollo13 opened this issue Nov 4, 2020 · 9 comments · Fixed by #9323
Closed

Nomad send wrong target_path to CSI plugin during read_only volume unpublish #9259

apollo13 opened this issue Nov 4, 2020 · 9 comments · Fixed by #9323
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug

Comments

@apollo13
Copy link
Contributor

apollo13 commented Nov 4, 2020

Nomad version

0.12.5

Issue

I could not find anything in the 1.0 changelog, so I am reporting this here. I have my own CSI plugin which gets a volume registered:

id = "wiki"
name = "wiki"
type = "csi"
external_id = "nomad/wiki"
plugin_id = "glusterfs"
access_mode = "multi-node-multi-writer"
attachment_mode = "file-system"

Now when I use the volume in a job like this:

job "infra.test" {
  datacenters = ["dc1"]

  group "test" {
    count = 1

    volume "wiki" {
      type      = "csi"
      read_only = true 
      source    = "wiki"
    }


    task "ubuntu" {
      driver = "docker"
      volume_mount {
       volume      = "wiki"
       destination = "/test"
       read_only   = false
      }

      config {
        image = "ubuntu:20.04"
        command = "sleep"
	args = ["infinity"]
      }
    }

  }
}

I get the following logs (from my plugin):

INFO:csi:Starting server on unix://csi/csi.sock
INFO:csi.node:Received mount request for nomad/wiki at /csi/per-alloc/05036ffa-e34e-4aba-26f5-4f83b80812cb/wiki/ro-file-system-multi-node-multi-writer
INFO:csi.node:Received unmount request for nomad/wiki at /csi/per-alloc/05036ffa-e34e-4aba-26f5-4f83b80812cb/wiki/rw-file-system-multi-node-multi-writer
WARNING:csi.node:Target path /csi/per-alloc/05036ffa-e34e-4aba-26f5-4f83b80812cb/wiki/rw-file-system-multi-node-multi-writer does not exist for nomad/wiki, most likely failed during mount.

As you can see in the logs NodePublishVolume got called with a patch containing ro-file-system-multi-node-multi-writer while NodeUnpublishVolume gets called with rw-file-system-multi-node-multi-writer (note the ro/rw asymmetry there). Now I do realize that my job file might be buggy (specifying read_only twice, once for volumes and once for volume_mount), but I think nomad should still send the path it send for mounting also during unmount.

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 4, 2020

Even if I remove read_only completely from volume_mount it sends the unmount request as rw.

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 4, 2020

If it helps: My plugin does not support volume staging, it just uses publish/unpublish

@tgross
Copy link
Member

tgross commented Nov 4, 2020

@apollo13 can you share the plugin job file and the Nomad debug logs for the client around the time of mounting?

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 4, 2020

The plugin job:

job "infra.storage" {
  datacenters = ["dc1"]

  type = "system"

  group "node" {
    task "plugin" {
      driver = "docker"

      config {
        image = "internal_image"

        args = [
          "--endpoint=unix://csi/csi.sock",
          "--node-id=${attr.unique.hostname}.domain",
          "--vol-server=${attr.unique.hostname}.domain"
        ]

        # node plugins must run as privileged jobs because they
        # mount disks to the host
        privileged = true
      }

      csi_plugin {
        id        = "glusterfs"
        type      = "node"
        mount_dir = "/csi"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

debug logs (stripped the status/leader query):

2020-11-04T14:23:17.162+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=800.551µs
2020-11-04T14:23:17.225+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:48996
2020-11-04T14:23:17.893+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=105.403µs
2020-11-04T14:23:21.376+0100 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-11-04T14:23:21.713+0100 [DEBUG] client: updated allocations: index=275309 total=11 pulled=1 filtered=10
2020-11-04T14:23:21.713+0100 [DEBUG] client: allocation updates: added=1 removed=0 updated=0 ignored=10
2020-11-04T14:23:21.714+0100 [DEBUG] client: allocation updates applied: added=1 removed=0 updated=0 ignored=10 errors=0
2020-11-04T14:23:21.783+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu path=/opt/paas/bin/nomad args=[/opt/paas/bin/nomad, logmon]
2020-11-04T14:23:21.783+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin started: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu path=/opt/paas/bin/nomad pid=18518
2020-11-04T14:23:21.783+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu path=/opt/paas/bin/nomad
2020-11-04T14:23:21.791+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: using plugin: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu version=2
2020-11-04T14:23:21.791+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon.nomad: plugin address: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu @module=logmon address=/tmp/plugin999567709 network=unix timestamp=2020-11-04T14:23:21.791+0100
2020-11-04T14:23:21.793+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu path=/opt/paas/data/nomad/alloc/548e9221-7677-be2f-20bf-892d5ee2364d/alloc/logs/.ubuntu.stdout.fifo @module=logmon timestamp=2020-11-04T14:23:21.793+0100
2020-11-04T14:23:21.793+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu @module=logmon path=/opt/paas/data/nomad/alloc/548e9221-7677-be2f-20bf-892d5ee2364d/alloc/logs/.ubuntu.stderr.fifo timestamp=2020-11-04T14:23:21.793+0100
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: cancelling removal of container image: driver=docker image_name=ubuntu:20.04
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: image reference count incremented: driver=docker image_name=ubuntu:20.04 image_id=sha256:d70eaf7277eada08fca944de400e7e4dd97b1262c06ed2b1011500caa4decaf1 references=1
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: configured resources: driver=docker task_name=ubuntu memory=314572800 memory_reservation=0 cpu_shares=100 cpu_quota=0 cpu_period=0
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: binding directories: driver=docker task_name=ubuntu binds="[]string{"/opt/paas/data/nomad/alloc/548e9221-7677-be2f-20bf-892d5ee2364d/alloc:/alloc", "/opt/paas/data/nomad/alloc/548e9221-7677-be2f-20bf-892d5ee2364d/ubuntu/local:/local", "/opt/paas/data/nomad/alloc/548e9221-7677-be2f-20bf-892d5ee2364d/ubuntu/secrets:/secrets"}"
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: networking mode not specified; using default: driver=docker task_name=ubuntu
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: setting container startup command: driver=docker task_name=ubuntu command="sleep infinity"
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: applied labels on the container: driver=docker task_name=ubuntu labels=map[com.hashicorp.nomad.alloc_id:548e9221-7677-be2f-20bf-892d5ee2364d]
2020-11-04T14:23:21.797+0100 [DEBUG] client.driver_mgr.docker: setting container name: driver=docker task_name=ubuntu container_name=ubuntu-548e9221-7677-be2f-20bf-892d5ee2364d
2020-11-04T14:23:21.843+0100 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=d6bd0e544078e18583e1aaae32a66d0badbb0608e5a8061c428cf63915347087
2020-11-04T14:23:21.955+0100 [DEBUG] client: updated allocations: index=275312 total=11 pulled=0 filtered=11
2020-11-04T14:23:21.955+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-04T14:23:21.955+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0
2020-11-04T14:23:22.142+0100 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=d6bd0e544078e18583e1aaae32a66d0badbb0608e5a8061c428cf63915347087
2020-11-04T14:23:22.142+0100 [DEBUG] client.driver_mgr.docker.docker_logger: starting plugin: driver=docker path=/opt/paas/bin/nomad args=[/opt/paas/bin/nomad, docker_logger]
2020-11-04T14:23:22.142+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin started: driver=docker path=/opt/paas/bin/nomad pid=18586
2020-11-04T14:23:22.142+0100 [DEBUG] client.driver_mgr.docker.docker_logger: waiting for RPC address: driver=docker path=/opt/paas/bin/nomad
2020-11-04T14:23:22.148+0100 [DEBUG] client.driver_mgr.docker.docker_logger: using plugin: driver=docker version=2
2020-11-04T14:23:22.148+0100 [DEBUG] client.driver_mgr.docker.docker_logger.nomad: plugin address: driver=docker network=unix @module=docker_logger address=/tmp/plugin860008604 timestamp=2020-11-04T14:23:22.147+0100
2020-11-04T14:23:22.148+0100 [DEBUG] client.driver_mgr.docker.docker_logger.nomad: using client connection initialized from environment: driver=docker @module=docker_logger timestamp=2020-11-04T14:23:22.148+0100
2020-11-04T14:23:22.365+0100 [DEBUG] client: updated allocations: index=275313 total=11 pulled=0 filtered=11
2020-11-04T14:23:22.366+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-04T14:23:22.366+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0
2020-11-04T14:23:27.182+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=701.906µs
2020-11-04T14:23:27.225+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:49024
2020-11-04T14:23:27.913+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=88.676µs
2020-11-04T14:23:32.176+0100 [DEBUG] nomad: memberlist: Initiating push/pull sync with: nomad01.global 172.22.3.201:4648
2020-11-04T14:23:32.369+0100 [DEBUG] client: updated allocations: index=275315 total=11 pulled=0 filtered=11
2020-11-04T14:23:32.369+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-04T14:23:32.369+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0
2020-11-04T14:23:36.081+0100 [DEBUG] client: updated allocations: index=275320 total=11 pulled=1 filtered=10
2020-11-04T14:23:36.081+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=10
2020-11-04T14:23:36.082+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=10 errors=0
2020-11-04T14:23:36.371+0100 [DEBUG] client: updated allocations: index=275322 total=11 pulled=0 filtered=11
2020-11-04T14:23:36.371+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-04T14:23:36.371+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0
2020-11-04T14:23:37.202+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=727.032µs
2020-11-04T14:23:37.226+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:49054
2020-11-04T14:23:37.939+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=161.185µs
2020-11-04T14:23:41.342+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=d6bd0e544078e18583e1aaae32a66d0badbb0608e5a8061c428cf63915347087 driver=docker
2020-11-04T14:23:41.346+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/paas/bin/nomad pid=18586
2020-11-04T14:23:41.346+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
2020-11-04T14:23:41.349+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu error="container stopped"
2020-11-04T14:23:41.353+0100 [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:d70eaf7277eada08fca944de400e7e4dd97b1262c06ed2b1011500caa4decaf1 references=0
2020-11-04T14:23:41.354+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d
2020-11-04T14:23:41.356+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu path=/opt/paas/bin/nomad pid=18518
2020-11-04T14:23:41.356+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu
2020-11-04T14:23:41.356+0100 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=548e9221-7677-be2f-20bf-892d5ee2364d task=ubuntu
2020-11-04T14:23:41.557+0100 [DEBUG] client: updated allocations: index=275325 total=11 pulled=0 filtered=11
2020-11-04T14:23:41.557+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-04T14:23:41.557+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0

Btw one last time (I hope I keep remembering it then). Those logs are generated via nomad monitor -log-level debug. If you mean other logs please tell me which and I'll reconfigure the server/client accordingly.

@tgross
Copy link
Member

tgross commented Nov 4, 2020

As you can see in the logs NodePublishVolume got called with a patch containing ro-file-system-multi-node-multi-writer while NodeUnpublishVolume gets called with rw-file-system-multi-node-multi-writer (note the ro/rw asymmetry there).

Ok, I think I figured out what's happening here and it's definitely not just your plugin. In 0.12.2 (in PR #8580 ) we made a change to fix a bug during node drains. The workflow changed slightly so that the client sends that unpublish RPC to the server and then it fires the node unpublish back to the client once it's figured out whether is needs to do anything with a controller. But there's a bug in the logic where we've effectively erased the "ro" information that's available for the claim when we send it up. It shouldn't be a huge lift to fix so I'll make sure to get that into the next 1.0 beta release.

Interestingly, earlier today I was doing some end-to-end testing for #9239 and was finding unexpected behaviors during unpublish that aren't getting picked up by our E2E suite. So I think I have the test case all ready to go!

@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Nov 4, 2020
@apollo13
Copy link
Contributor Author

apollo13 commented Nov 5, 2020

Thank you for the investigation, this is great to hear. Between the issues I run into and possible bugs in the CSI plugin it is often hard to determine for me what is at fault :)

@tgross
Copy link
Member

tgross commented Nov 5, 2020

Between the issues I run into and possible bugs in the CSI plugin it is often hard to determine for me what is at fault

Understood on that. Thanks for trying this all out -- having someone using Nomad who's working on a custom plugin is a great way for us to find lots of issues in the CSI beta!

@tgross
Copy link
Member

tgross commented Nov 11, 2020

The fix in #9323 will ship in 1.0-beta3 later this week.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants