Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of gracefully recover tasks that use csi node plugins into release/1.5.x #16848

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #16809 to be assessed for backporting due to the inclusion of the label backport/1.5.x.

The below text is copied from the body of the original PR.


With happy running tasks that use a CSI volume, on client restart they may fail to be recovered and instead be killed and rescheduled, even though there is a healthy happy CSI plugin running right next to them.

The simplest version of the error condition I could replicate was using this repo's demo/csi/hostpath example. With the tasks all set up, I stop the client and start it again:

2023-04-05T22:10:33.940Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 task=redis type=Received msg="Task received by client" failed=false
2023-04-05T22:10:33.948Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d013ee08-10fc-ff0e-ef5e-cc444048fbfd task=plugin type=Received msg="Task received by client" failed=false
2023-04-05T22:10:33.959Z [ERROR] client.alloc_runner: prerun failed: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 error="pre-run hook \"csi_hook\" failed: plugin hostpath-plugin0 for type csi-node not found"
2023-04-05T22:10:33.964Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 task=redis type="Setup Failure" msg="failed to setup alloc: pre-run hook \"csi_hook\" failed: plugin hostpath-plugin0 for type csi-node not found" failed=true
2023-04-05T22:10:33.974Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d013ee08-10fc-ff0e-ef5e-cc444048fbfd task=plugin type="Plugin became healthy" msg="plugin: hostpath-plugin0" failed=false
2023-04-05T22:10:36.973Z [INFO]  client.gc: marking allocation for GC: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279

The client doesn't know yet that the CSI plugin task is healthy, because it happens to try to recover the task that needs it first.

After this change, which retries for some time (one minute (is that ok?)) to find the plugin that it expects to exist:

2023-04-05T22:18:03.805Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=3f38e8b5-106b-87c7-9df4-b5815fa955d5 task=redis type=Received msg="Task received by client" failed=false
2023-04-05T22:18:03.819Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d6f262f3-b20d-89d2-1bdd-6f92a01c2003 task=plugin type=Received msg="Task received by client" failed=false
2023-04-05T22:18:04.116Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d6f262f3-b20d-89d2-1bdd-6f92a01c2003 task=plugin type="Plugin became healthy" msg="plugin: hostpath-plugin0" failed=false
2023-04-05T22:18:04.262Z [INFO]  client.alloc_runner.runner_hook.csi_hook: found CSI plugin: alloc_id=3f38e8b5-106b-87c7-9df4-b5815fa955d5 type=csi-node name=hostpath-plugin0

We may wish to test with a more complex CSI setup to make sure that this fully Fixes #13028

@hc-github-team-nomad-core hc-github-team-nomad-core merged commit ace9705 into release/1.5.x Apr 11, 2023
@hc-github-team-nomad-core hc-github-team-nomad-core deleted the backport/fix-13028-csi-volumes-on-client-start/rarely-divine-grizzly branch April 11, 2023 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants