Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad CSI plugin is started to often #9248

Closed
apollo13 opened this issue Nov 2, 2020 · 28 comments
Closed

Nomad CSI plugin is started to often #9248

apollo13 opened this issue Nov 2, 2020 · 28 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug

Comments

@apollo13
Copy link
Contributor

apollo13 commented Nov 2, 2020

Nomad version

0.12.5

Operating system and Environment details

Debian 10

Issue

I am currently developing my own CSI plugin. After pushing a new version of the plugin, nomad seems to be somewhat confused (I have three nodes):

nomad job status infra.storage
ID            = infra.storage
Name          = infra.storage
Submit Date   = 2020-11-02T17:29:15+01:00
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
node        0       0         6        0       12        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
91238d57  8e100f4b  node        5        run      running   3m17s ago   3m10s ago
a287fb2b  a8efb906  node        5        run      running   3m17s ago   3m10s ago
6b624b22  42eb20db  node        5        run      running   3m17s ago   3m10s ago
912bdaf2  a8efb906  node        3        stop     complete  5m32s ago   3m52s ago
43fce8b6  8e100f4b  node        3        stop     complete  5m32s ago   3m52s ago
beb54cc1  42eb20db  node        3        stop     complete  5m32s ago   3m52s ago
4b1916de  a8efb906  node        2        stop     complete  25m35s ago  5m32s ago
841062f1  42eb20db  node        2        stop     complete  25m35s ago  5m32s ago
86bf414a  8e100f4b  node        2        stop     complete  25m35s ago  5m32s ago
5e753855  a8efb906  node        1        stop     running   46m45s ago  25m35s ago
3ea40540  8e100f4b  node        1        stop     running   46m45s ago  25m35s ago
00cc2bed  42eb20db  node        1        stop     running   46m45s ago  25m35s ago
6bf2b42e  a8efb906  node        0        stop     complete  53m46s ago  46m44s ago
d90b73f1  8e100f4b  node        0        stop     complete  55m14s ago  46m44s ago
e320de75  42eb20db  node        0        stop     complete  57m3s ago   46m44s ago
78b5c5d3  8e100f4b  node        0        stop     complete  1h13m ago   55m15s ago
961abe9a  42eb20db  node        0        stop     complete  1h13m ago   57m21s ago
1ef9ecfe  a8efb906  node        0        stop     complete  1h13m ago   53m45s ago

As you can see version 1 is supposed to be stopped but it continues running (beats me why). The alloc log says:

nomad alloc status 3ea40540
ID                   = 3ea40540-2b3a-432a-1ac2-2c5c43b80bd6
Eval ID              = 3ccea2d0
Name                 = infra.storage.node[0]
Node ID              = 8e100f4b
Node Name            = nomad01
Job ID               = infra.storage
Job Version          = 1
Client Status        = running
Client Description   = Tasks are running
Desired Status       = stop
Desired Description  = alloc is being updated due to job update
Created              = 47m34s ago
Modified             = 26m24s ago
Replacement Alloc ID = 86bf414a

Task "plugin" is "running"
Task Resources
CPU        Memory          Disk     Addresses
1/500 MHz  28 MiB/256 MiB  300 MiB  

Task Events:
Started At     = 2020-11-02T15:45:53Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type                   Description
2020-11-02T17:06:57+01:00  Killing                Sent interrupt. Waiting 5s before force killing
2020-11-02T16:45:59+01:00  Plugin became healthy  plugin: glusterfs
2020-11-02T16:45:53+01:00  Started                Task started by client
2020-11-02T16:45:47+01:00  Driver                 Downloading image
2020-11-02T16:45:47+01:00  Task Setup             Building Task Directory
2020-11-02T16:45:47+01:00  Received               Task received by client

Nomad Client logs (if appropriate)

Nov 02 16:06:57 nomad01 containerd[549]: time="2020-11-02T16:06:57.209027129Z" level=info msg="shim reaped" id=07739a2331c1d7243769ed818c59cffaaccf1f4bb2c55f66a73d5e0d4c58cf1e
Nov 02 16:06:57 nomad01 dockerd[601]: time="2020-11-02T16:06:57.219364377Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Nov 02 16:06:57 nomad01 kernel: docker0: port 1(vethb7d4380) entered disabled state
Nov 02 16:06:57 nomad01 kernel: veth929581b: renamed from eth0
Nov 02 16:06:57 nomad01 systemd-udevd[6648]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Nov 02 16:06:57 nomad01 systemd-udevd[6648]: Using default interface naming scheme 'v240'.
Nov 02 16:06:57 nomad01 kernel: docker0: port 1(vethb7d4380) entered disabled state
Nov 02 16:06:57 nomad01 kernel: device vethb7d4380 left promiscuous mode
Nov 02 16:06:57 nomad01 kernel: docker0: port 1(vethb7d4380) entered disabled state
Nov 02 16:06:57 nomad01 systemd[1]: run-docker-netns-cff181f413f3.mount: Succeeded.
Nov 02 16:06:57 nomad01 systemd[1]: var-lib-docker-overlay2-a724ab60782bdf7a43a2fee8065431ea6f6b3c670059a773810de99281e8d7da-merged.mount: Succeeded.
Nov 02 16:06:57 nomad01 nomad[29837]:     2020-11-02T16:06:57.469Z [INFO]  client.gc: marking allocation for GC: alloc_id=3ea40540-2b3a-432a-1ac2-2c5c43b80bd6
Nov 02 16:06:57 nomad01 nomad[29837]:     2020-11-02T16:06:57.486Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=86bf414a-6f42-bf6e-7bbf-db123b85a5d9 task=plugin @module=logmon path=/opt/paas/data/
Nov 02 16:06:57 nomad01 nomad[29837]:     2020-11-02T16:06:57.487Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=86bf414a-6f42-bf6e-7bbf-db123b85a5d9 task=plugin @module=logmon path=/opt/paas/data/
Nov 02 16:06:57 nomad01 nomad[29837]:     2020-11-02T16:06:57.520Z [ERROR] nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"
Nov 02 16:06:57 nomad01 nomad[29837]:     2020-11-02T16:06:57.865Z [ERROR] nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"
Nov 02 16:06:57 nomad01 nomad[29837]:     2020-11-02T16:06:57.867Z [ERROR] nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"

Did I forget to implement something? scratches head

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 2, 2020

Btw; is there any way for me to kill those allocs now?

@tgross
Copy link
Member

tgross commented Nov 2, 2020

Hi @apollo13, sorry to hear you're having trouble. Any chance you can provide debug logs from the shutdown of those allocations? Or the allocation logs, so that we can see if the process is hanging during SIGTERM somehow.

Once you've tried to grab those logs, nomad alloc stop or nomad alloc signal should let you send a termination signal to them.

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 2, 2020

@tgross if you tell me how to get those debug logs? I added the client logs in the initial comment which at least shows an error. I'll leave the alloc running till we have more infos.

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 2, 2020

The alloc logs show:

➜  nomad-jobs git:(master) ✗ nomad alloc logs -stderr -tail  -n 2000 00cc2bed
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/python/lib/python3.9/site-packages/csi_plugin_glusterfs/__main__.py", line 123, in <module>
    serve()
  File "/opt/python/lib/python3.9/site-packages/csi_plugin_glusterfs/__main__.py", line 119, in serve
    server.wait_for_termination()
  File "/opt/python/lib/python3.9/site-packages/grpc/_server.py", line 977, in wait_for_termination
    return _common.wait(self._state.termination_event.wait,
  File "/opt/python/lib/python3.9/site-packages/grpc/_common.py", line 141, in wait
    _wait_once(wait_fn, MAXIMUM_WAIT_TIMEOUT, spin_cb)
  File "/opt/python/lib/python3.9/site-packages/grpc/_common.py", line 106, in _wait_once
    wait_fn(timeout=timeout)
  File "/usr/local/lib/python3.9/threading.py", line 574, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/local/lib/python3.9/threading.py", line 316, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt

Seems as if it just ignored the keyboard interrupt?

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 2, 2020

I just tried nomad monitor -log-level debug and then stopping the allocation. Didn't have anything useful in the logs. Wonder if I got the correct ones. Didn't seem to stop the allocation either

@tgross
Copy link
Member

tgross commented Nov 2, 2020

@tgross if you tell me how to get those debug logs? I added the client logs in the initial comment which at least shows an error.

Ah, I see. Well unfortunately those logs are at info. If whatever hasn't stopped is still running in a loop and logging, you might be able to get those logs via nomad operator debug. You'll want to set nomad operator debug -log-level=debug -node-id=8e100f4b -server-id=<server1>,<server2>,<server3>.

But if nothing is currently logging that won't help, as nomad operator debug command won't get us historical logs. In that case, if you have enable_debug set, it might be helpful if you could dump a goroutine stack trace (found at http://localhost:4646/v1/agent/pprof/goroutine?debug=1). That could tell us if we've got a stuck lock somewhere.

By the way, I would strongly recommend running at log_level=debug while trying to develop something like a CSI plugin; there's just too much that can go sideways.

Seems as if it just ignored the keyboard interrupt?

That shows me it crashed when getting SIGINT, rather than exiting gracefully (also, not SIGTERM as I said incorrectly earlier). For some applications this would be fine (albeit messy in the logs), but I'm wondering if there are mounts getting bound into the allocation directory that aren't getting released. Nomad is showing the allocation in the "running" state, but it may be that it's stuck trying to run poststop hooks in the allocation runner or task runner. Is the Docker container for the plugin actually still running?

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 2, 2020

Hi Tim,

sadly I do have not any debug logs enabled :) The CSI plugin is so simple that it actually worked at the first try gg

The application seems to exit immediately with exit code 130 on SIGKILL (just tested locally).

The containers stopped properly. The only thing that still thinks the alloc is running is nomad, nomad alloc stop does not stop it :)

Since I cannot use -server-id and -node-id together apparently I only used node-id while trying to stop an alloc:

nomad monitor -node-id 8e100f4b -log-level debug
2020-11-02T19:08:25.098Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=35.53µs
2020-11-02T19:08:25.482Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=22.758µs
2020-11-02T19:08:26.102Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=40.313µs
2020-11-02T19:08:26.227Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=131.713µs
2020-11-02T19:08:26.390Z [DEBUG] http: request complete: method=GET path=/v1/allocations?prefix=3ea40540 duration=353.461µs
2020-11-02T19:08:26.511Z [DEBUG] http: request complete: method=GET path=/v1/allocation/3ea40540-2b3a-432a-1ac2-2c5c43b80bd6?namespace=default duration=267.043µs
2020-11-02T19:08:26.545Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=20.025µs
2020-11-02T19:08:26.637Z [DEBUG] http: request complete: method=PUT path=/v1/allocation/3ea40540-2b3a-432a-1ac2-2c5c43b80bd6/stop duration=3.886239ms
2020-11-02T19:08:26.637Z [DEBUG] client: updated allocations: index=270619 total=11 pulled=0 filtered=11
2020-11-02T19:08:26.637Z [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=11
2020-11-02T19:08:26.637Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=11 errors=0
2020-11-02T19:08:26.758Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/5b6b62bd-fb82-6cf0-8350-0c2d47e7dba1 duration=254.421µs
2020-11-02T19:08:26.804Z [DEBUG] consul.sync: sync complete: registered_services=2 deregistered_services=0 registered_checks=0 deregistered_checks=0
2020-11-02T19:08:26.881Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/5b6b62bd-fb82-6cf0-8350-0c2d47e7dba1/allocations duration=209.677µs
2020-11-02T19:08:27.106Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=55.412µs
2020-11-02T19:08:27.608Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=20.152µs
2020-11-02T19:08:27.999Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/5b6b62bd-fb82-6cf0-8350-0c2d47e7dba1 duration=256.338µs
2020-11-02T19:08:28.109Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=22.138µs
2020-11-02T19:08:28.119Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/5b6b62bd-fb82-6cf0-8350-0c2d47e7dba1/allocations duration=264.849µs
2020-11-02T19:08:28.617Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=23.79µs
2020-11-02T19:08:29.114Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=25.654µs
2020-11-02T19:08:29.589Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=175.605µs
2020-11-02T19:08:29.623Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=21.45µs

Any ideas, otherwise I'll just kill the allocations (if that even works :D)

@tgross
Copy link
Member

tgross commented Nov 2, 2020

If you can't get that pprof endpoint information, then I'd suggest that given this is a development environment, you could send the Nomad client agent a SIGQUIT signal. That'll get it to dump a bunch of data to its logs (in the systemd journal).

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 2, 2020

Sorry, since nomad alloc signal -s TERM resulted in a timeout I did nomad stop -purge to get rid of the CSI plugin and redeployed it. I will try to get more logs if it ever reoccurs.

I still think that the initial error:

Nov 02 16:06:57 nomad01 nomad[29837]:     2020-11-02T16:06:57.867Z [ERROR] nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"

might be something that might be relatable to the source code. (whatever "not found" means here). What I noticed during stopping the job though is that the mounts became broken on the other jobs. Since the are fuser mounts, there seems to be at least some dependency on the container that mounted them. What is the recommended way to allow node plugin updates while existing mounts are in place?

@tgross
Copy link
Member

tgross commented Nov 2, 2020

nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"

That happens when the client node sends a fingerprint when the allocation running the plugin is terminal but the plugin has already been removed.

What I noticed during stopping the job though is that the mounts became broken on the other jobs. Since the are fuser mounts, there seems to be at least some dependency on the container that mounted them. What is the recommended way to allow node plugin updates while existing mounts are in place?

Nomad (and k8s for that matter) only explicitly expects that the CSI plugin is available for communication during the various lifecycle events of the volume (ex. publish/unpublish). I just double-checked and the CSI specification doesn't call it out, but if the volume disappears when the plugin process exits that's not really going to work. I don't see any other examples in the driver list that jump out at me as doing this. If there's any possible way to get it so that the plugin process doesn't have to remain resident, that'll be what you want to do here.

Edit: while I'm thinking of it, if you haven't already see the Storage internals doc, it'll probably be helpful for your development process.

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 2, 2020

If there's any possible way to get it so that the plugin process doesn't have to remain resident, that'll be what you want to do here.

Yeah I just realized that my problem isn't the mount itself but more generally how glusterfs works:

root@6f67b76269c8:~# glusterfs -s nomad03.bap.lan --volfile-id=nomad --subdir-mount=/wiki /mnt
root@6f67b76269c8:~# ps -ef
UID         PID   PPID  C STIME TTY          TIME CMD
root          1      0  0 22:36 pts/0    00:00:00 bash
root        313      1  0 22:37 ?        00:00:00 glusterfs -s nomad03.bap.lan --volfile-id=nomad --subdir-mount=/wiki /mnt
root        323      1  0 22:37 pts/0    00:00:00 ps -ef

This means that it has a process running that is bound to the lifetime of the node plugin container. I wonder how other drivers do this (or maybe their mounts work without leaving any processes behind?). On the other hand it kinda makes sense that a fuser mount would have a process lying around.

Edit: while I'm thinking of it, if you haven't already see the Storage internals doc, it'll probably be helpful for your development process.

Thanks, I have seen it and the plugin mostly works. I guess my main confusion is how to mount something so it "moves" out to the host and is independent of the container as soon as the mount command is finished.

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 3, 2020

@tgross I managed to get debug logs during a shutdown which left the alloc running. Sadly the do not show much either:

2020-11-03T11:24:33.665+0100 [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=587.809µs
2020-11-03T11:24:33.835+0100 [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=709.08µs
2020-11-03T11:24:34.530+0100 [DEBUG] client: updated allocations: index=272409 total=4 pulled=1 filtered=3
2020-11-03T11:24:34.530+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=3
2020-11-03T11:24:34.531+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=3 errors=0
2020-11-03T11:24:34.669+0100 [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=461.827µs
2020-11-03T11:24:34.747+0100 [DEBUG] client: updated allocations: index=272412 total=4 pulled=0 filtered=4
2020-11-03T11:24:34.747+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=4
2020-11-03T11:24:34.747+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=4 errors=0
2020-11-03T11:24:34.839+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/paas/bin/nomad pid=16296
2020-11-03T11:24:34.839+0100 [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
2020-11-03T11:24:34.839+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=6d467f55-fb34-096e-48be-1c78a04cf745 task=plugin error="container stopped"
2020-11-03T11:24:34.859+0100 [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:9cf2a10f134b0ef4651ca956eaf2aec51b2f5a5e1dc89dcab2dc754f8811b555 references=0
2020-11-03T11:24:34.861+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=6d467f55-fb34-096e-48be-1c78a04cf745
2020-11-03T11:24:34.862+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=6d467f55-fb34-096e-48be-1c78a04cf745 task=plugin path=/opt/paas/bin/nomad pid=16212
2020-11-03T11:24:34.862+0100 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=6d467f55-fb34-096e-48be-1c78a04cf745 task=plugin
2020-11-03T11:24:34.862+0100 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=6d467f55-fb34-096e-48be-1c78a04cf745 task=plugin
2020-11-03T11:24:34.863+0100 [DEBUG] client: shutting down CSI plugin: name=glusterfs type=csi-node
2020-11-03T11:24:34.863+0100 [DEBUG] client.glusterfs: finished client unary call: grpc.code=Canceled duration=6.152µs grpc.service=csi.v1.Identity grpc.method=Probe
2020-11-03T11:24:34.897+0100 [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=1.095477ms
2020-11-03T11:24:35.005+0100 [DEBUG] client: updated allocations: index=272414 total=4 pulled=0 filtered=4
2020-11-03T11:24:35.005+0100 [DEBUG] client: allocation updates: added=0 removed=0 updated=0 ignored=4
2020-11-03T11:24:35.005+0100 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=0 ignored=4 errors=0
2020-11-03T11:24:35.108+0100 [ERROR] nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"
2020-11-03T11:24:35.147+0100 [ERROR] nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"
2020-11-03T11:24:35.212+0100 [ERROR] nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"

Does this help in any way? I'll try to nuke the job from orbit and retry :)

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 3, 2020

Good news, I managed to get an operator debug log of the whole things from all nodes in the cluster. Where can I send it (I rather not upload it here)?

EDIT:// found the email addr, you got mail :)

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 3, 2020

Commenting solely on (if others also run into this issue):

Nomad (and k8s for that matter) only explicitly expects that the CSI plugin is available for communication during the various lifecycle events of the volume (ex. publish/unpublish). I just double-checked and the CSI specification doesn't call it out, but if the volume disappears when the plugin process exits that's not really going to work. I don't see any other examples in the driver list that jump out at me as doing this. If there's any possible way to get it so that the plugin process doesn't have to remain resident, that'll be what you want to do here.

The fact that mounts are destroyed during node plugin restarts is a known thing for mounts that happen via FUSE and similar; see ceph/ceph-csi#703 for details.

@tgross
Copy link
Member

tgross commented Nov 4, 2020

(or maybe their mounts work without leaving any processes behind?)

Yes, exactly that. Typically if you mount something on Unix it just stays mounted. This is why you can restart a Nomad client without destroying all the directories we bind-mount to tasks. And it's also why Nomad bugs like #7848 #8814 exist; Nomad needs to explicitly make sure we clean up the mounts.

The fact that mounts are destroyed during node plugin restarts is a known thing for mounts that happen via FUSE and similar; see ceph/ceph-csi#703 for details.

Yeah, that's unfortunate... I'm not convinced that CSI is the right use case for something like FUSE if they're talking about running a separate service on the host anyways -- you'd be better off running the whole thing on the host and avoiding all the lifetime management complexities of CSI. (Aside: this is my take on the CSI spec in general but the FUSE use case especially underscores it! 😁 )

I managed to get an operator debug log of the whole things from all nodes in the cluster.

Great! Am I right in understanding this means this is a reproducible problem?

I dug through these and there are a couple things that jump out at me. You've got three allocations of interest here but for purposes of what I'm showing below let's focus on 7e1b11e9-971c-f367-b4ab-d309aca6c9e6.

The allocation status recorded in the allocations.json snapshot doesn't match up with what you're getting in nomad alloc status nor with the debug logs. There are also no events shown here for the shutdown process:

alloc 7e1b11e9-971c-f367-b4ab-d309aca6c9e6
{
  "ID": "7e1b11e9-971c-f367-b4ab-d309aca6c9e6",
  "EvalID": "c869b27a-d7d1-52e5-ab5c-413131d02c03",
  "Name": "infra.storage.node[0]",
  "Namespace": "default",
  "NodeID": "8e100f4b-6d1b-ca4a-e8d6-b27a65ddac93",
  "NodeName": "nomad01",
  "JobID": "infra.storage",
  "JobType": "system",
  "JobVersion": 0,
  "TaskGroup": "node",
  "DesiredStatus": "run",
  "DesiredDescription": "",
  "ClientStatus": "running",
  "ClientDescription": "Tasks are running",
  "TaskStates": {
    "plugin": {
      "State": "running",
      "Failed": false,
      "Restarts": 0,
      "LastRestart": "0001-01-01T00:00:00Z",
      "StartedAt": "2020-11-03T10:35:33.468602911Z",
      "FinishedAt": "0001-01-01T00:00:00Z",
      "Events": [
        {
          "Type": "Received",
          "Time": 1604399732979323400,
          "DisplayMessage": "Task received by client",
          "Details": {},
          "Message": "",
          "FailsTask": false,
          "RestartReason": "",
          "SetupError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "Signal": 0,
          "KillReason": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": "",
          "DiskLimit": 0,
          "DiskSize": 0,
          "FailedSibling": "",
          "VaultError": "",
          "TaskSignalReason": "",
          "TaskSignal": "",
          "GenericSource": ""
        },
        {
          "Type": "Task Setup",
          "Time": 1604399732981774600,
          "DisplayMessage": "Building Task Directory",
          "Details": {
            "message": "Building Task Directory"
          },
          "Message": "Building Task Directory",
          "FailsTask": false,
          "RestartReason": "",
          "SetupError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "Signal": 0,
          "KillReason": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": "",
          "DiskLimit": 0,
          "DiskSize": 0,
          "FailedSibling": "",
          "VaultError": "",
          "TaskSignalReason": "",
          "TaskSignal": "",
          "GenericSource": ""
        },
        {
          "Type": "Started",
          "Time": 1604399733468598500,
          "DisplayMessage": "Task started by client",
          "Details": {},
          "Message": "",
          "FailsTask": false,
          "RestartReason": "",
          "SetupError": "",
          "DriverError": "",
          "DriverMessage": "",
          "ExitCode": 0,
          "Signal": 0,
          "KillReason": "",
          "KillTimeout": 0,
          "KillError": "",
          "StartDelay": 0,
          "DownloadError": "",
          "ValidationError": "",
          "DiskLimit": 0,
          "DiskSize": 0,
          "FailedSibling": "",
          "VaultError": "",
          "TaskSignalReason": "",
          "TaskSignal": "",
          "GenericSource": ""
        }
      ]
    }
  },
  "DeploymentStatus": null,
  "FollowupEvalID": "",
  "RescheduleTracker": null,
  "PreemptedAllocations": null,
  "PreemptedByAllocation": "",
  "CreateIndex": 272481,
  "ModifyIndex": 272488,
  "CreateTime": 1604399732937911000,
  "ModifyTime": 1604399733556201200
}

That's just weird and probably not related at all to CSI. Maybe the snapshot of that status is before the allocation stops though? Might be useful to take a second snapshot right after the problem.

Next if we look at the logs for the shutdown of that allocation:

2020-11-03T10:36:03.399Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/paas/bin/nomad pid=31159
2020-11-03T10:36:03.399Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
2020-11-03T10:36:03.401Z [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: failed to s
tart stats collection for task with unrecoverable error: alloc_id=7e1b11e9-971c-f367-b4ab-d309aca6c9e6 task=plugin error="container stopped"
2020-11-03T10:36:03.425Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=100.867µs

2020-11-03T10:36:03.433Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:9cf2a10f134b0ef4651ca956eaf2aec51b2f5a5e1dc89dcab2dc754f8811b555 references=0
2020-11-03T10:36:03.438Z [INFO]  client.gc: marking allocation for GC: alloc_id=7e1b11e9-971c-f367-b4ab-d309aca6c9e6
2020-11-03T10:36:03.442Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=7e1b11e9-971c-f367-b4ab-d309aca6c9e6 task=plugin path=/opt/paas/bin/nomad pid=31088
2020-11-03T10:36:03.442Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=7e1b11e9-971c-f367-b4ab-d309aca6c9e6 task=plugin
2020-11-03T10:36:03.442Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=7e1b11e9-971c-f367-b4ab-d309aca6c9e6 task=plugin
2020-11-03T10:36:03.443Z [DEBUG] client: shutting down CSI plugin: name=glusterfs type=csi-node
2020-11-03T10:36:03.443Z [DEBUG] client.glusterfs: finished client unary call: grpc.code=Canceled durati
on=14.969µs grpc.service=csi.v1.Identity grpc.method=Probe
2020-11-03T10:36:03.500Z [ERROR] nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"
2020-11-03T10:36:03.504Z [DEBUG] http: request complete: method=GET path=/v1/status/leader duration=14.26µs
2020-11-03T10:36:03.565Z [ERROR] nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"
2020-11-03T10:36:03.623Z [ERROR] nomad.fsm: UpdateAllocFromClient failed: error="csi_plugins delete error: not found"

This mostly looks like an orderly shutdown of the plugin, except of course that the plugin delete fails. Maybe this is an issue of us counting the number of active plugins incorrectly? When you have all 3 running what does nomad plugin status glusterfs look like?

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 4, 2020

So this is the status now (I have recreated the job):

ID                   = glusterfs
Provider             = eu.apolloner.csi.glusterfs
Version              = 0.1.0
Controllers Healthy  = 0
Controllers Expected = 0
Nodes Healthy        = 3
Nodes Expected       = 3

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
bdbf9aa2  42eb20db  node        0        run      running  3h18m ago  3h18m ago
92025c25  8e100f4b  node        0        run      running  3h18m ago  3h18m ago
7139ed67  a8efb906  node        0        run      running  3h18m ago  3h18m ago

After nomad job stop infra.storage and waiting a bit:

➜  nomad-jobs git:(master) ✗ nomad plugin status glusterfs
ID                   = glusterfs
Provider             = eu.apolloner.csi.glusterfs
Version              = 0.1.0
Controllers Healthy  = 0
Controllers Expected = 0
Nodes Healthy        = 0
Nodes Expected       = 3

Allocations
No allocations placed
➜  nomad-jobs git:(master) ✗ nomad job status infra.storage
ID            = infra.storage
Name          = infra.storage
Submit Date   = 2020-11-04T11:45:56+01:00
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = dead (stopped)
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
node        0       0         3        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
7139ed67  a8efb906  node        0        stop     running  3h21m ago  29s ago
92025c25  8e100f4b  node        0        stop     running  3h21m ago  29s ago
bdbf9aa2  42eb20db  node        0        stop     running  3h21m ago  29s ago

Redoing a start again:

➜  nomad-jobs git:(master) ✗ nomad plugin status glusterfs
ID                   = glusterfs
Provider             = eu.apolloner.csi.glusterfs
Version              = 0.1.0
Controllers Healthy  = 0
Controllers Expected = 0
Nodes Healthy        = 3
Nodes Expected       = 6

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
3cefb4b5  42eb20db  node        2        run      running  17s ago  10s ago
a646d2ed  a8efb906  node        2        run      running  17s ago  10s ago
1ee9b902  8e100f4b  node        2        run      running  17s ago  10s ago
➜  nomad-jobs git:(master) ✗ nomad job status infra.storage
ID            = infra.storage
Name          = infra.storage
Submit Date   = 2020-11-04T15:07:32+01:00
Type          = system
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
node        0       0         6        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
1ee9b902  8e100f4b  node        2        run      running  20s ago    13s ago
3cefb4b5  42eb20db  node        2        run      running  20s ago    13s ago
a646d2ed  a8efb906  node        2        run      running  20s ago    13s ago
7139ed67  a8efb906  node        0        stop     running  3h21m ago  1m23s ago
92025c25  8e100f4b  node        0        stop     running  3h21m ago  1m23s ago
bdbf9aa2  42eb20db  node        0        stop     running  3h21m ago  1m23s ago

As you can see the old allocs are still there in the job and not stopping… I'll see if I can get another (longer) operator dump today.

you'd be better off running the whole thing on the host and avoiding all the lifetime management complexities of CSI. (Aside: this is my take on the CSI spec in general but the FUSE use case especially underscores it! 😁 )

I am with you on that one, so how do I do that and tell nomad about the plugin?

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 4, 2020

I managed to get an operator debug log of the whole things from all nodes in the cluster.

Great! Am I right in understanding this means this is a reproducible problem?

Yes, simply running nomad job run (with the job from #9259 (comment)) and then stopping it will show the allocations as "running" as shown in the comment above. Honestly I think my CSI plugin behaves well enough (added logging also during the shutdown), that I think that this is a bug in Nomad 0.12.5 that should immediately manifest with every CSI plugin. I checked the changelog for 0.12.6 & .7 but nothing interesting stuck out there.

Edit:// Maybe it is specific to node only plugins. You should be able to test with any CSI plugin though if you set it's type to node only. Nomad should not check the other stuff then I think…

@tgross
Copy link
Member

tgross commented Nov 4, 2020

As you can see the old allocs are still there in the job and not stopping… I'll see if I can get another (longer) operator dump today.

The important bit is that debug log around shutdown. But if there's no new information there I'll have to go from what you've been able to give me (thanks again for that!). There's something happening here that isn't the logs, and I want to see if I can replicate it without your custom plugin.

I am with you on that one, so how do I do that and tell nomad about the plugin?

You can't, that's a limitation of CSI itself unfortunately. I have some thoughts on how we might be able to provide a better solution (which would violate the CSI spec) but for the moment there's no workaround for that.

@apollo13
Copy link
Contributor Author

apollo13 commented Nov 4, 2020

I want to see if I can replicate it without your custom plugin.

If you think access to the plugin would help I can upload the docker container, it should only need access to an (unauthenticated) NFS server

You can't, that's a limitation of CSI itself unfortunately. I have some thoughts on how we might be able to provide a better solution (which would violate the CSI spec) but for the moment there's no workaround for that.

I started reading the CSI spec since yesterday, so excuse the question, but why does CSI limit that? Couldn't you just say:

csi_node_endpoints {
  gluster = "/run/glusterfs.csi"
}

in the nomad client config and nomad could match that up to controllers use the gluster name?

@apollo13
Copy link
Contributor Author

@tgross Have you been able to reproduce the issue? Can I help with anything to make sure it ends up fixed in 1.0?

@tgross
Copy link
Member

tgross commented Nov 12, 2020

I've been tied up in the recent CVE response, so not yet. I'm working on CSI stuff now though so hoping to have some answers for you soon here.

@tgross
Copy link
Member

tgross commented Nov 25, 2020

@apollo13 I suspect this will be closed by #9438, which will ship in Nomad 1.0. If you don't feel like waiting a little bit (should be 2-3 weeks from now), feel free to build from master and take a crack at it with that.

@apollo13
Copy link
Contributor Author

I'll try to get a build running to verify. Thanks!

@tgross
Copy link
Member

tgross commented Dec 1, 2020

There may be a non-CSI aspect to this one that I missed... we're currently working these two issues: #9375 #9376

@apollo13
Copy link
Contributor Author

apollo13 commented Dec 1, 2020

Ok, I didn't yet manage to test master :/

@apollo13
Copy link
Contributor Author

apollo13 commented Dec 2, 2020

@tgross Good news, I finally managed to test this (and all the other CSI issues you fixed) against the current master (and also compared with the beta) and it seems that #9438 does fix this. I do not think that the other two issues are the cause here (in this special case). As such I am closing this issue -- thank you for your help!

@apollo13 apollo13 closed this as completed Dec 2, 2020
@tgross
Copy link
Member

tgross commented Dec 2, 2020

So glad to hear! 😁

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
None yet
Development

No branches or pull requests

2 participants