Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: client restarts cause later plugin restarts to fail #12744

Closed
kvkang opened this issue Apr 22, 2022 · 7 comments · Fixed by #12752
Closed

CSI: client restarts cause later plugin restarts to fail #12744

kvkang opened this issue Apr 22, 2022 · 7 comments · Fixed by #12752
Assignees
Milestone

Comments

@kvkang
Copy link

kvkang commented Apr 22, 2022

Nomad version

Nomad v1.2.6 (a6c6b47)

Operating system and Environment details

Ubuntu 20.04.3 LTS

Issue

  1. When a nomad client restart with a csi plugin has deployed
  2. nomad skipping done prestart hook csi_plugin_supervisor because task_runner has record the hook state in localstate
  3. but csi_plugin_supervisor add devMount("/dev") and configMount("/csi") to task_runner.hookResources.Mounts in csi_plugin_supervisor.Prestart which is skiped.
  4. so task_runner.hookResources.Mounts lost config after nomad restart and client restore allocs.
  5. At that time. restart docker or kill the csi container.
  6. nomad will restart the container when task_runner.hookResources.Mounts which lost "/dev" and "/csi", then csi container start fail.

Reproduction steps

  1. deploy csi job which task set restart like:
restart {
	interval = "2s"
	attempts = 2
	delay    = "1s"
	mode     = "fail"
}
  1. restart nomad client
  2. kill the csi container or restart docker

Expected Result

csi container start with mount "/csi" and "/dev"

Actual Result

contrainer start fail and alloc log show show
F0422 05:28:57.588532 1 server.go:81] Failed to listen: listen unix /csi/csi.sock: bind: no such file or directory

Job file (if appropriate)

job "hostpath" {
    name = "hostpath"
    namespace = "hostpath"    
    type = "service"
    datacenters = [
        "not-assign",
    ]
    group "hostpath" {
        constraint {
            attribute = "${node.unique.id}"
            operator  = "="
            value     = "c218c3b4-f6e5-9f15-52c1-15a1541f6460"
        }
        task "container" {
            driver = "docker"
            config {
              image = "quay.io/k8scsi/hostpathplugin:v1.2.0"
              args = [
                "--drivername=csi-hostpath",
                "--v=5",
                "--endpoint=unix://csi/csi.sock",
                "--nodeid=node-${NOMAD_ALLOC_INDEX}",
              ]
              privileged = true
              force_pull = true
            }
            csi_plugin {
              id        = "hostpath-plugin0"
              type      = "monolith" #node" # doesn't support Controller RPCs
              mount_dir = "/csi"
            }
            restart {
                interval = "2s"
                attempts = 2
                delay    = "1s"
                mode     = "fail"
            }
        }
    }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@tgross tgross self-assigned this Apr 22, 2022
@tgross
Copy link
Member

tgross commented Apr 22, 2022

Hi @kvkang!

I was able to reproduce what you're seeing. What makes this a little interesting is that there's two variants of the problem that I could create: one where the plugin task was killed while the client was offline, and one where the plugin task was killed after the client came back from being offline.

If any task (not just CSI plugins) stops while the Nomad client is stopped, the Nomad client will be unable to restore the task handle and will have to restart the task. But in the case of a CSI plugin we can't come back from that restart.

If I kill a plugin task without having restarting the client, everything works as expected:

Recent Events:
Time                       Type                   Description
2022-04-22T09:42:51-04:00  Started                Task started by client
2022-04-22T09:42:50-04:00  Restarting             Task restarting in 15.846198796s
2022-04-22T09:42:50-04:00  Terminated             Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-04-22T09:42:06-04:00  Plugin became healthy  plugin: org.democratic-csi.nfs
2022-04-22T09:41:56-04:00  Started                Task started by client
2022-04-22T09:41:56-04:00  Task Setup             Building Task Directory
2022-04-22T09:41:56-04:00  Received               Task received by client

But if I restart the client and wait for it to come back, then kill the plugin task again, we get the error you reported:

Recent Events:
Time                       Type                     Description
2022-04-22T09:44:05-04:00  Plugin became unhealthy  Error: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/nomad/data/client/csi/plugins/74c8f452-3f3e-8d8f-87ba-546c01df9edf/csi.sock: connect: connection refused"
2022-04-22T09:43:51-04:00  Restarting               Task restarting in 18.692336688s
2022-04-22T09:43:51-04:00  Terminated               Exit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"
2022-04-22T09:43:50-04:00  Started                  Task started by client
2022-04-22T09:43:49-04:00  Restarting               Task restarting in 15.691112908s
2022-04-22T09:43:49-04:00  Terminated               Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-04-22T09:43:35-04:00  Plugin became healthy    plugin: org.democratic-csi.nfs
2022-04-22T09:42:51-04:00  Started                  Task started by client
2022-04-22T09:42:50-04:00  Restarting               Task restarting in 15.846198796s
2022-04-22T09:42:50-04:00  Terminated               Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"

If we don't wait for the client to come back and kill the plugin task while the client is offline, we get a crash loop but with slightly different error messages:

Recent Events:
Time                       Type        Description
2022-04-22T09:30:05-04:00  Restarting  Task restarting in 15.954528487s
2022-04-22T09:30:05-04:00  Terminated  Exit Code: 1, Exit Message: "Docker container exited with
non-zero exit code: 1"
2022-04-22T09:30:05-04:00  Started     Task started by client
2022-04-22T09:29:44-04:00  Restarting  Task restarting in 18.622212124s
2022-04-22T09:29:44-04:00  Terminated  Exit Code: 137, Exit Message: "Docker container exited wit
h non-zero exit code: 137"
2022-04-22T09:28:07-04:00  Started     Task started by client
2022-04-22T09:28:07-04:00  Task Setup  Building Task Directory
2022-04-22T09:28:06-04:00  Received    Task received by client

The plugin error I get is similar:

{"level":"info","message":"starting csi server - name: org.democratic-csi.nfs, version: 1.6.1, driver: nfs-client, mode: controller, csi version: 1.5.0, address: , socket: unix:///csi/csi.sock","service":"democratic-csi"}
Error: No address added out of total 1 resolved
at bindResultPromise.then.errorString (/home/csi/app/node_modules/@grpc/grpc-js/build/src/server.js:415:42)
at processTicksAndRejections (node:internal/process/task_queues:96:5)

In both cases, eventually after we've gone thru the restart.attempts the allocation will get rescheduled and then it'll be starting from scratch and work just fine. But we definitely don't want to have to go back to the server for a reschedule.

During normal plugin restarts, we'd expect skipping the plugin_supervisor_hook.Prestart method to be fine, but in this case we're missing the in-memory state the taskrunner has, just as you've said.

This definitely a bug and it's not fixed by the work we've done for CSI in Nomad 1.3.0-beta.1. I'll dig into this further and report back here shortly.

@tgross tgross changed the title CSI: skip csi_plugin_supervisor cause alloc restart fail after restart nomad CSI: client restarts cause later plugin restarts to fail Apr 22, 2022
@tgross tgross added this to the 1.3.0 milestone Apr 22, 2022
@tgross
Copy link
Member

tgross commented Apr 22, 2022

Will be fixed in #12752 which will ship in Nomad 1.3.0 GA, and backported to 1.2.x and 1.1.x.

@tgross
Copy link
Member

tgross commented Apr 22, 2022

That's been merged and will ship in the upcoming release. Thanks for opening this issue @kvkang!

@zizon
Copy link

zizon commented Apr 24, 2022

@tgross May Poststart of csiPluginSupervisorHook being leaking goroutine for ensureSupervisorLoop each time?

@tgross
Copy link
Member

tgross commented Apr 25, 2022

Hi @zizon! The context in ensureSupervisorLoop is the shutdownCtx which gets closed in Stop. It's a long-running goroutine but shouldn't be a leak. If you have analysis to a contrary, please open a new issue with the details.

@zizon
Copy link

zizon commented Apr 25, 2022

Hi @zizon! The context in ensureSupervisorLoop is the shutdownCtx which gets closed in Stop. It's a long-running goroutine but shouldn't be a leak. If you have analysis to a contrary, please open a new issue with the details.

continue on #12772

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants