Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of gracefully recover tasks that use csi node plugins into release/1.4.x #16847

Conversation

hc-github-team-nomad-core
Copy link
Contributor

Backport

This PR is auto-generated from #16809 to be assessed for backporting due to the inclusion of the label backport/1.4.x.

The below text is copied from the body of the original PR.


With happy running tasks that use a CSI volume, on client restart they may fail to be recovered and instead be killed and rescheduled, even though there is a healthy happy CSI plugin running right next to them.

The simplest version of the error condition I could replicate was using this repo's demo/csi/hostpath example. With the tasks all set up, I stop the client and start it again:

2023-04-05T22:10:33.940Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 task=redis type=Received msg="Task received by client" failed=false
2023-04-05T22:10:33.948Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d013ee08-10fc-ff0e-ef5e-cc444048fbfd task=plugin type=Received msg="Task received by client" failed=false
2023-04-05T22:10:33.959Z [ERROR] client.alloc_runner: prerun failed: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 error="pre-run hook \"csi_hook\" failed: plugin hostpath-plugin0 for type csi-node not found"
2023-04-05T22:10:33.964Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279 task=redis type="Setup Failure" msg="failed to setup alloc: pre-run hook \"csi_hook\" failed: plugin hostpath-plugin0 for type csi-node not found" failed=true
2023-04-05T22:10:33.974Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d013ee08-10fc-ff0e-ef5e-cc444048fbfd task=plugin type="Plugin became healthy" msg="plugin: hostpath-plugin0" failed=false
2023-04-05T22:10:36.973Z [INFO]  client.gc: marking allocation for GC: alloc_id=0480ac51-dd38-49ec-8831-48fd3c674279

The client doesn't know yet that the CSI plugin task is healthy, because it happens to try to recover the task that needs it first.

After this change, which retries for some time (one minute (is that ok?)) to find the plugin that it expects to exist:

2023-04-05T22:18:03.805Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=3f38e8b5-106b-87c7-9df4-b5815fa955d5 task=redis type=Received msg="Task received by client" failed=false
2023-04-05T22:18:03.819Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d6f262f3-b20d-89d2-1bdd-6f92a01c2003 task=plugin type=Received msg="Task received by client" failed=false
2023-04-05T22:18:04.116Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=d6f262f3-b20d-89d2-1bdd-6f92a01c2003 task=plugin type="Plugin became healthy" msg="plugin: hostpath-plugin0" failed=false
2023-04-05T22:18:04.262Z [INFO]  client.alloc_runner.runner_hook.csi_hook: found CSI plugin: alloc_id=3f38e8b5-106b-87c7-9df4-b5815fa955d5 type=csi-node name=hostpath-plugin0

We may wish to test with a more complex CSI setup to make sure that this fully Fixes #13028

@hc-github-team-nomad-core hc-github-team-nomad-core force-pushed the backport/fix-13028-csi-volumes-on-client-start/directly-dynamic-cicada branch 2 times, most recently from 507856c to a58efc2 Compare April 11, 2023 17:25
@gulducat
Copy link
Member

Closing this PR as the change has already been applied in commit b3085e1

@gulducat gulducat closed this Apr 11, 2023
@gulducat gulducat deleted the backport/fix-13028-csi-volumes-on-client-start/directly-dynamic-cicada branch April 11, 2023 19:17
sundbry pushed a commit to arctype-co/nomad that referenced this pull request Nov 9, 2023
…elease/1.4.x (hashicorp#16847)

This pull request was automerged via backport-assistant
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants