CSI Plugin Task Should restore first #12265

zizon · 2022-03-11T10:25:10Z

Nomad version

Nomad v1.2.6 (a6c6b47)

Operating system and Environment details

Ubuntu 16.04.6 LTS

Issue

When a nomad client node with task that using CSI Volumes restarted,
CSI Monolith Plugin is also deploy as a Nomad Job with associated allocation on this restarting Client
Said, a alloc using CSI Volume is firstly restore(before CSI alloc), and Claim the Volume is should attach.
But the CSI Plugin alloc is not yet restore or the ensureSupervisorLoop() is still under progressing and not yet trigger registration
the task in 3 will fail and the container it is running will not be managed by Nomad anymore, thus becoming dangling.
Since the resource obtain by such container will not release but yet not tracked by nomad, it will mis-guide the scheduler to do allocation decision.

Reproduction steps

Expected Result

CSI Plugin should be register and ready before restoring any other tasks/allocs

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

The text was updated successfully, but these errors were encountered:

tgross · 2022-03-14T15:28:06Z

Hi @zizon! Just for clarity, when you're talking about the client node restart, do you mean the Nomad client process or the entire Nomad client host?

In any case, we can't have the CSI plugin task get restored first without a fairly radical re-architecture of the client, but what we can do is make sure that the csi_hook that runs for the allocation that's claiming a volume can gracefully handle this case and retry prerun steps it needs. I think #12113 actually gets us what we need here (that's planned to ship in Nomad 1.3.0), but I can try to verify that it solves the problem you're talking about.

the task in 3 will fail and the container it is running will not be managed by Nomad anymore, thus becoming dangling.

Since the resource obtain by such container will not release but yet not tracked by nomad, it will mis-guide the scheduler to do allocation decision.

This is a little more concerning. When you say "dangling", the container is left running even though it failed to restore?

zizon · 2022-03-15T01:45:42Z

On Mon, Mar 14, 2022 at 11:28 PM Tim Gross ***@***.***> wrote: Hi @zizon <https://github.com/zizon>! Just for clarity, when you're talking about the client node restart, do you mean the Nomad client process or the entire Nomad client host?

The client process, e.g. systemctl restart nomad

In any case, we can't have the CSI plugin task get restored first without a fairly radical re-architecture of the client, but what we *can* do is make sure that the csi_hook that runs for the allocation that's claiming a volume can gracefully handle this case and retry prerun steps it needs. I think #12113 <#12113> actually gets us what we need here (that's planned to ship in Nomad 1.3.0), but I can try to verify that it solves the problem you're talking about. 1. the task in 3 will fail and the container it is running will not be managed by Nomad anymore, thus becoming dangling. 2. Since the resource obtain by such container will not release but yet not tracked by nomad, it will mis-guide the scheduler to do allocation decision. This is a little more concerning. When you say "dangling", the container is left running even though it failed to restore?

Yep, the container is still running during the nomad client process restart. 1. The restore is failed by an unsuccessful CSI hook( https://github.com/hashicorp/nomad/blob/v1.2.6/client/allocrunner/alloc_runner.go#L321 ) 2. if understood correctly, It thus skips runtasks at line 333, which would like to do the bookkeeping.

…

— Reply to this email directly, view it on GitHub <#12265 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEKPIDQCR2BL2H6AW3XE5DU75LJDANCNFSM5QPGYQ5Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

zizon · 2022-03-15T01:56:39Z

Besides a re-architecture, a delay/second retry for failed allocs may just work.

In https://github.com/hashicorp/nomad/blob/v1.2.6/client/client.go#L1118 , collect the failed allocs and retry a second time in a next for loop?

zizon · 2022-03-15T08:17:39Z

There is another possible issues, during client restart.

In cisHook of claimWithRetry(https://github.com/hashicorp/nomad/blob/main/client/allocrunner/csi_hook.go#L259)

It can failed contacting servers if

not yet joining/discover servers
or network issue
which then exhaust retry, causing prerun failed

tgross · 2022-03-15T16:01:59Z

The restore is failed by an unsuccessful CSI hook(https://github.com/hashicorp/nomad/blob/v1.2.6/client/allocrunner/alloc_runner.go#L321)

if understood correctly, It thus skips runtasks at line 333, which would
like to do the bookkeeping.

If we return any error in prerun, we skip ahead to postrun, which is where we do all the cleanup bookkeeping. So we should be ok there. I'll verify that we handle that case correctly though, as it's possible that something in postrun ends up breaking in that code path.

In cisHook of claimWithRetry(https://github.com/hashicorp/nomad/blob/main/client/allocrunner/csi_hook.go#L259)

It can failed contacting servers if

not yet joining/discover servers
or network issue
which then exhaust retry, causing prerun failed

The claimWithRetry checks the isRetryableClaimRPCError function, which accounts for the case where there's no servers or no leader. That code hasn't shipped in a release yet. It'll be in Nomad 1.3.0.

zizon · 2022-03-16T02:43:47Z

I had keep some case informations, hope it helps.

below shows a failed restore container after client restart.
note the metric port 8987.

And this is the container running on the host, at the bottom:910d04

The associated alloc id:

And the relevant alloc logs of that client after restart.
Note, it marked the alloc for gc after failed restore, but the container keep running which preventing the cpuset cgroup from removal(device busy), thus impacting cgroup reconcile.

zizon · 2022-03-16T06:29:52Z

#11477 similar issue.

tgross · 2022-03-16T13:13:12Z

below shows a failed restore container after client restart

This is with a build from main?

zizon · 2022-03-16T13:21:36Z

no, it is from apt v1.2.6

…

On Wed, Mar 16, 2022 at 9:13 PM Tim Gross ***@***.***> wrote: below shows a failed restore container after client restart This is with a build from main? — Reply to this email directly, view it on GitHub <#12265 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEKPIDORL3T7C36SASXV3LVAHM7FANCNFSM5QPGYQ5Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

tgross · 2022-03-16T13:40:35Z

Ok great! That case is covered with the new isRetryableClaimRPCError function. That code hasn't shipped in a release yet. It'll be in Nomad 1.3.0 and will get backported to 1.2.x when it does. Thanks!

github-actions · 2022-10-10T02:44:28Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

zizon added the type/bug label Mar 11, 2022

tgross added the theme/storage label Mar 11, 2022

tgross self-assigned this Mar 11, 2022

tgross added stage/waiting-reply theme/client-restart labels Mar 14, 2022

tgross removed the stage/waiting-reply label Mar 15, 2022

tgross closed this as completed Mar 16, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI Plugin Task Should restore first #12265

CSI Plugin Task Should restore first #12265

zizon commented Mar 11, 2022

tgross commented Mar 14, 2022

zizon commented Mar 15, 2022 via email

zizon commented Mar 15, 2022

zizon commented Mar 15, 2022

tgross commented Mar 15, 2022

zizon commented Mar 16, 2022

zizon commented Mar 16, 2022

tgross commented Mar 16, 2022

zizon commented Mar 16, 2022 via email

tgross commented Mar 16, 2022

github-actions bot commented Oct 10, 2022

CSI Plugin Task Should restore first #12265

CSI Plugin Task Should restore first #12265

Comments

zizon commented Mar 11, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented Mar 14, 2022

zizon commented Mar 15, 2022 via email

zizon commented Mar 15, 2022

zizon commented Mar 15, 2022

tgross commented Mar 15, 2022

zizon commented Mar 16, 2022

zizon commented Mar 16, 2022

tgross commented Mar 16, 2022

zizon commented Mar 16, 2022 via email

tgross commented Mar 16, 2022

github-actions bot commented Oct 10, 2022