CSI: client restarts cause later plugin restarts to fail #12744

kvkang · 2022-04-22T07:03:48Z

Nomad version

Nomad v1.2.6 (a6c6b47)

Operating system and Environment details

Ubuntu 20.04.3 LTS

Issue

When a nomad client restart with a csi plugin has deployed
nomad skipping done prestart hook csi_plugin_supervisor because task_runner has record the hook state in localstate
but csi_plugin_supervisor add devMount("/dev") and configMount("/csi") to task_runner.hookResources.Mounts in csi_plugin_supervisor.Prestart which is skiped.
so task_runner.hookResources.Mounts lost config after nomad restart and client restore allocs.
At that time. restart docker or kill the csi container.
nomad will restart the container when task_runner.hookResources.Mounts which lost "/dev" and "/csi", then csi container start fail.

Reproduction steps

deploy csi job which task set restart like:

restart {
	interval = "2s"
	attempts = 2
	delay    = "1s"
	mode     = "fail"
}

restart nomad client
kill the csi container or restart docker

Expected Result

csi container start with mount "/csi" and "/dev"

Actual Result

contrainer start fail and alloc log show show
F0422 05:28:57.588532 1 server.go:81] Failed to listen: listen unix /csi/csi.sock: bind: no such file or directory

Job file (if appropriate)

job "hostpath" {
    name = "hostpath"
    namespace = "hostpath"    
    type = "service"
    datacenters = [
        "not-assign",
    ]
    group "hostpath" {
        constraint {
            attribute = "${node.unique.id}"
            operator  = "="
            value     = "c218c3b4-f6e5-9f15-52c1-15a1541f6460"
        }
        task "container" {
            driver = "docker"
            config {
              image = "quay.io/k8scsi/hostpathplugin:v1.2.0"
              args = [
                "--drivername=csi-hostpath",
                "--v=5",
                "--endpoint=unix://csi/csi.sock",
                "--nodeid=node-${NOMAD_ALLOC_INDEX}",
              ]
              privileged = true
              force_pull = true
            }
            csi_plugin {
              id        = "hostpath-plugin0"
              type      = "monolith" #node" # doesn't support Controller RPCs
              mount_dir = "/csi"
            }
            restart {
                interval = "2s"
                attempts = 2
                delay    = "1s"
                mode     = "fail"
            }
        }
    }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

tgross · 2022-04-22T13:52:35Z

Hi @kvkang!

I was able to reproduce what you're seeing. What makes this a little interesting is that there's two variants of the problem that I could create: one where the plugin task was killed while the client was offline, and one where the plugin task was killed after the client came back from being offline.

If any task (not just CSI plugins) stops while the Nomad client is stopped, the Nomad client will be unable to restore the task handle and will have to restart the task. But in the case of a CSI plugin we can't come back from that restart.

If I kill a plugin task without having restarting the client, everything works as expected:

Recent Events:
Time                       Type                   Description
2022-04-22T09:42:51-04:00  Started                Task started by client
2022-04-22T09:42:50-04:00  Restarting             Task restarting in 15.846198796s
2022-04-22T09:42:50-04:00  Terminated             Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-04-22T09:42:06-04:00  Plugin became healthy  plugin: org.democratic-csi.nfs
2022-04-22T09:41:56-04:00  Started                Task started by client
2022-04-22T09:41:56-04:00  Task Setup             Building Task Directory
2022-04-22T09:41:56-04:00  Received               Task received by client

But if I restart the client and wait for it to come back, then kill the plugin task again, we get the error you reported:

Recent Events:
Time                       Type                     Description
2022-04-22T09:44:05-04:00  Plugin became unhealthy  Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/nomad/data/client/csi/plugins/74c8f452-3f3e-8d8f-87ba-546c01df9edf/csi.sock: connect: connection refused"
2022-04-22T09:43:51-04:00  Restarting               Task restarting in 18.692336688s
2022-04-22T09:43:51-04:00  Terminated               Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2022-04-22T09:43:50-04:00  Started                  Task started by client
2022-04-22T09:43:49-04:00  Restarting               Task restarting in 15.691112908s
2022-04-22T09:43:49-04:00  Terminated               Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-04-22T09:43:35-04:00  Plugin became healthy    plugin: org.democratic-csi.nfs
2022-04-22T09:42:51-04:00  Started                  Task started by client
2022-04-22T09:42:50-04:00  Restarting               Task restarting in 15.846198796s
2022-04-22T09:42:50-04:00  Terminated               Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"

If we don't wait for the client to come back and kill the plugin task while the client is offline, we get a crash loop but with slightly different error messages:

Recent Events:
Time                       Type        Description
2022-04-22T09:30:05-04:00  Restarting  Task restarting in 15.954528487s
2022-04-22T09:30:05-04:00  Terminated  Exit Code: 1, Exit Message: "Docker container exited with
non-zero exit code: 1"
2022-04-22T09:30:05-04:00  Started     Task started by client
2022-04-22T09:29:44-04:00  Restarting  Task restarting in 18.622212124s
2022-04-22T09:29:44-04:00  Terminated  Exit Code: 137, Exit Message: "Docker container exited wit
h non-zero exit code: 137"
2022-04-22T09:28:07-04:00  Started     Task started by client
2022-04-22T09:28:07-04:00  Task Setup  Building Task Directory
2022-04-22T09:28:06-04:00  Received    Task received by client

The plugin error I get is similar:

{"level":"info","message":"starting csi server - name: org.democratic-csi.nfs, version: 1.6.1, driver: nfs-client, mode: controller, csi version: 1.5.0, address: , socket: unix:///csi/csi.sock","service":"democratic-csi"}
Error: No address added out of total 1 resolved
at bindResultPromise.then.errorString (/home/csi/app/node_modules/@grpc/grpc-js/build/src/server.js:415:42)
at processTicksAndRejections (node:internal/process/task_queues:96:5)

In both cases, eventually after we've gone thru the restart.attempts the allocation will get rescheduled and then it'll be starting from scratch and work just fine. But we definitely don't want to have to go back to the server for a reschedule.

During normal plugin restarts, we'd expect skipping the plugin_supervisor_hook.Prestart method to be fine, but in this case we're missing the in-memory state the taskrunner has, just as you've said.

This definitely a bug and it's not fixed by the work we've done for CSI in Nomad 1.3.0-beta.1. I'll dig into this further and report back here shortly.

tgross · 2022-04-22T15:27:48Z

Will be fixed in #12752 which will ship in Nomad 1.3.0 GA, and backported to 1.2.x and 1.1.x.

tgross · 2022-04-22T17:08:19Z

That's been merged and will ship in the upcoming release. Thanks for opening this issue @kvkang!

zizon · 2022-04-24T02:35:23Z

@tgross May Poststart of csiPluginSupervisorHook being leaking goroutine for ensureSupervisorLoop each time?

tgross · 2022-04-25T13:00:28Z

Hi @zizon! The context in ensureSupervisorLoop is the shutdownCtx which gets closed in Stop. It's a long-running goroutine but shouldn't be a leak. If you have analysis to a contrary, please open a new issue with the details.

zizon · 2022-04-25T14:41:33Z

Hi @zizon! The context in ensureSupervisorLoop is the shutdownCtx which gets closed in Stop. It's a long-running goroutine but shouldn't be a leak. If you have analysis to a contrary, please open a new issue with the details.

continue on #12772

github-actions · 2022-10-08T02:36:21Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

kvkang added the type/bug label Apr 22, 2022

tgross self-assigned this Apr 22, 2022

tgross added the theme/storage label Apr 22, 2022

tgross changed the title ~~CSI: skip csi_plugin_supervisor cause alloc restart fail after restart nomad~~ CSI: client restarts cause later plugin restarts to fail Apr 22, 2022

tgross added this to the 1.3.0 milestone Apr 22, 2022

tgross mentioned this issue Apr 22, 2022

CSI: plugin supervisor prestart should not mark itself done #12752

Merged

tgross closed this as completed in #12752 Apr 22, 2022

This was referenced Apr 22, 2022

Backport of CSI: plugin supervisor prestart should not mark itself done into release/1.1.x #12754

Merged

Backport of CSI: plugin supervisor prestart should not mark itself done into release/1.2.x #12755

Merged

zizon mentioned this issue Apr 25, 2022

CSI plugin supervisor can spawn one extra goroutine #12772

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: client restarts cause later plugin restarts to fail #12744

CSI: client restarts cause later plugin restarts to fail #12744

kvkang commented Apr 22, 2022 •

edited

Loading

tgross commented Apr 22, 2022

tgross commented Apr 22, 2022

tgross commented Apr 22, 2022

zizon commented Apr 24, 2022

tgross commented Apr 25, 2022 •

edited

Loading

zizon commented Apr 25, 2022 •

edited

Loading

github-actions bot commented Oct 8, 2022

CSI: client restarts cause later plugin restarts to fail #12744

CSI: client restarts cause later plugin restarts to fail #12744

Comments

kvkang commented Apr 22, 2022 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented Apr 22, 2022

tgross commented Apr 22, 2022

tgross commented Apr 22, 2022

zizon commented Apr 24, 2022

tgross commented Apr 25, 2022 • edited Loading

zizon commented Apr 25, 2022 • edited Loading

github-actions bot commented Oct 8, 2022

kvkang commented Apr 22, 2022 •

edited

Loading

tgross commented Apr 25, 2022 •

edited

Loading

zizon commented Apr 25, 2022 •

edited

Loading