Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: allocrunner w/ volumes fails to restore in csi_hook after client restart #11477

Closed
BlizzTom opened this issue Nov 9, 2021 · 4 comments · Fixed by #12113
Closed

CSI: allocrunner w/ volumes fails to restore in csi_hook after client restart #11477

BlizzTom opened this issue Nov 9, 2021 · 4 comments · Fixed by #12113
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Milestone

Comments

@BlizzTom
Copy link

BlizzTom commented Nov 9, 2021

Nomad version

Nomad v1.1.6 (b83d623fb5ff475d5e40df21e9e7a61834071078)

Issue is also present in 1.1.2 to 1.2.0 Beta.

Operating system and Environment details

Linux <hostname> 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Issue

When restarting nomad without a drain using a CSI plugin and a mounted volume, nomad will fail to restore the allocation and leave the process running.

Reproduction steps

Use a CSI plugin to mount a volume to a task
Restart the nomad process without draining the node

Expected Result

Allocation is restored

Actual Result

Allocation is failed, but process remains running, volume remains mounted.

Nomad Client logs (if appropriate)

2021-11-08T20:51:21.310Z [INFO]  client.plugin: starting plugin manager: plugin-type=csi
2021-11-08T20:51:21.310Z [INFO]  client.plugin: starting plugin manager: plugin-type=driver
2021-11-08T20:51:21.310Z [INFO]  client.plugin: starting plugin manager: plugin-type=device
2021-11-08T20:51:21.338Z [WARN]  client.server_mgr: no servers available
2021-11-08T20:51:21.352Z [INFO]  client: started client: node_id=cb1b133f-724b-31b5-a4a2-226dcb11811e
2021-11-08T20:51:21.353Z [INFO]  client.gc: marking allocation for GC: alloc_id=40aa9f7a-fc98-f038-7e33-2778a00cf3b9
2021-11-08T20:51:21.355Z [WARN]  client.server_mgr: no servers available
2021-11-08T20:51:21.355Z [ERROR] client.alloc_runner: prerun failed: alloc_id=af3e5bb3-d229-1a3a-083d-f47304e30cf8 error="pre-run hook "csi_hook" failed: claim volumes: no servers"
2021-11-08T20:51:21.356Z [INFO]  agent.joiner: starting retry join: servers=nomad.service.cloud-insight.dmz.discovery.blizzard.net
2021-11-08T20:51:21.357Z [WARN]  client.server_mgr: no servers available
2021/11/08 20:51:21.359986 [INFO] (runner) creating new runner (dry: false, once: false)
2021/11/08 20:51:21.360571 [INFO] (runner) creating watcher
2021/11/08 20:51:21.360776 [INFO] (runner) starting
2021-11-08T20:51:21.362Z [INFO]  client.gc: marking allocation for GC: alloc_id=af3e5bb3-d229-1a3a-083d-f47304e30cf8
2021-11-08T20:51:21.385Z [INFO]  agent.joiner: retry join completed: initial_servers=1 agent_mode=client
2021-11-08T20:51:21.637Z [INFO]  client: node registration complete

Possibly related to #10833

Specifically it appears that the csi_hook prerun requires that the retry join has completed to make the RPC call CSIVolume.Claim. However, there is a race in the go routines for the retry join and the restore allocations.

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Nov 9, 2021
@tgross
Copy link
Member

tgross commented Nov 9, 2021

Hi @BlizzTom! This does seem to be related to #10833, but I don't think I expected to see that in the case where the client has simply restarted and not been marked lost.

@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Nov 9, 2021
@tgross
Copy link
Member

tgross commented Feb 3, 2022

Following up on this because #10833 has been closed out: on further review it's pretty clear we should be handling the case where the servers are disconnected more safely. The changes in #11892 will partially help here. But we'll also need this work upcoming work on disconnected client handling anyways. I'll be looking into this as part of other plugin work going on this next few weeks.

@tgross tgross changed the title Nomad CSI Zombies Allocations on restart CSI: allocrunner fails to restore after client restart Feb 3, 2022
@tgross tgross changed the title CSI: allocrunner fails to restore after client restart CSI: allocrunner w/ volumes fails to restore in csi_hook after client restart Feb 3, 2022
@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Feb 3, 2022
@tgross tgross self-assigned this Feb 17, 2022
@tgross
Copy link
Member

tgross commented Feb 23, 2022

Will be fixed by #12113, expected to ship in Nomad 1.3.0

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/storage type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants