Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI Plugin Task Should restore first #12265

Closed
zizon opened this issue Mar 11, 2022 · 11 comments
Closed

CSI Plugin Task Should restore first #12265

zizon opened this issue Mar 11, 2022 · 11 comments

Comments

@zizon
Copy link

zizon commented Mar 11, 2022

Nomad version

Nomad v1.2.6 (a6c6b47)

Operating system and Environment details

Ubuntu 16.04.6 LTS

Issue

  1. When a nomad client node with task that using CSI Volumes restarted,
  2. CSI Monolith Plugin is also deploy as a Nomad Job with associated allocation on this restarting Client
  3. Said, a alloc using CSI Volume is firstly restore(before CSI alloc), and Claim the Volume is should attach.
  4. But the CSI Plugin alloc is not yet restore or the ensureSupervisorLoop() is still under progressing and not yet trigger registration
  5. the task in 3 will fail and the container it is running will not be managed by Nomad anymore, thus becoming dangling.
  6. Since the resource obtain by such container will not release but yet not tracked by nomad, it will mis-guide the scheduler to do allocation decision.

Reproduction steps

Expected Result

CSI Plugin should be register and ready before restoring any other tasks/allocs

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@zizon zizon added the type/bug label Mar 11, 2022
@tgross tgross self-assigned this Mar 11, 2022
@tgross
Copy link
Member

tgross commented Mar 14, 2022

Hi @zizon! Just for clarity, when you're talking about the client node restart, do you mean the Nomad client process or the entire Nomad client host?

In any case, we can't have the CSI plugin task get restored first without a fairly radical re-architecture of the client, but what we can do is make sure that the csi_hook that runs for the allocation that's claiming a volume can gracefully handle this case and retry prerun steps it needs. I think #12113 actually gets us what we need here (that's planned to ship in Nomad 1.3.0), but I can try to verify that it solves the problem you're talking about.

  1. the task in 3 will fail and the container it is running will not be managed by Nomad anymore, thus becoming dangling.
  2. Since the resource obtain by such container will not release but yet not tracked by nomad, it will mis-guide the scheduler to do allocation decision.

This is a little more concerning. When you say "dangling", the container is left running even though it failed to restore?

@zizon
Copy link
Author

zizon commented Mar 15, 2022 via email

@zizon
Copy link
Author

zizon commented Mar 15, 2022

Besides a re-architecture, a delay/second retry for failed allocs may just work.

In https://github.com/hashicorp/nomad/blob/v1.2.6/client/client.go#L1118 , collect the failed allocs and retry a second time in a next for loop?

@zizon
Copy link
Author

zizon commented Mar 15, 2022

There is another possible issues, during client restart.

In cisHook of claimWithRetry(https://github.com/hashicorp/nomad/blob/main/client/allocrunner/csi_hook.go#L259)

It can failed contacting servers if

  1. not yet joining/discover servers
  2. or network issue
    which then exhaust retry, causing prerun failed

@tgross
Copy link
Member

tgross commented Mar 15, 2022

  1. The restore is failed by an unsuccessful CSI hook(https://github.com/hashicorp/nomad/blob/v1.2.6/client/allocrunner/alloc_runner.go#L321)
  2. if understood correctly, It thus skips runtasks at line 333, which would
    like to do the bookkeeping.

If we return any error in prerun, we skip ahead to postrun, which is where we do all the cleanup bookkeeping. So we should be ok there. I'll verify that we handle that case correctly though, as it's possible that something in postrun ends up breaking in that code path.

In cisHook of claimWithRetry(https://github.com/hashicorp/nomad/blob/main/client/allocrunner/csi_hook.go#L259)

It can failed contacting servers if

not yet joining/discover servers
or network issue
which then exhaust retry, causing prerun failed

The claimWithRetry checks the isRetryableClaimRPCError function, which accounts for the case where there's no servers or no leader. That code hasn't shipped in a release yet. It'll be in Nomad 1.3.0.

@zizon
Copy link
Author

zizon commented Mar 16, 2022

I had keep some case informations, hope it helps.

below shows a failed restore container after client restart.
note the metric port 8987.
image

And this is the container running on the host, at the bottom:910d04
image

The associated alloc id:
image

And the relevant alloc logs of that client after restart.
Note, it marked the alloc for gc after failed restore, but the container keep running which preventing the cpuset cgroup from removal(device busy), thus impacting cgroup reconcile.
image

@zizon
Copy link
Author

zizon commented Mar 16, 2022

#11477 similar issue.

@tgross
Copy link
Member

tgross commented Mar 16, 2022

below shows a failed restore container after client restart

This is with a build from main?

@zizon
Copy link
Author

zizon commented Mar 16, 2022 via email

@tgross
Copy link
Member

tgross commented Mar 16, 2022

Ok great! That case is covered with the new isRetryableClaimRPCError function. That code hasn't shipped in a release yet. It'll be in Nomad 1.3.0 and will get backported to 1.2.x when it does. Thanks!

@tgross tgross closed this as completed Mar 16, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

2 participants