-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI Plugin Task Should restore first #12265
Comments
Hi @zizon! Just for clarity, when you're talking about the client node restart, do you mean the Nomad client process or the entire Nomad client host? In any case, we can't have the CSI plugin task get restored first without a fairly radical re-architecture of the client, but what we can do is make sure that the
This is a little more concerning. When you say "dangling", the container is left running even though it failed to restore? |
On Mon, Mar 14, 2022 at 11:28 PM Tim Gross ***@***.***> wrote:
Hi @zizon <https://github.com/zizon>! Just for clarity, when you're
talking about the client node restart, do you mean the Nomad client process
or the entire Nomad client host?
The client process, e.g. systemctl restart nomad
In any case, we can't have the CSI plugin task get restored first without
a fairly radical re-architecture of the client, but what we *can* do is
make sure that the csi_hook that runs for the allocation that's claiming
a volume can gracefully handle this case and retry prerun steps it needs. I
think #12113 <#12113> actually
gets us what we need here (that's planned to ship in Nomad 1.3.0), but I
can try to verify that it solves the problem you're talking about.
1. the task in 3 will fail and the container it is running will not be
managed by Nomad anymore, thus becoming dangling.
2. Since the resource obtain by such container will not release but
yet not tracked by nomad, it will mis-guide the scheduler to do allocation
decision.
This is a little more concerning. When you say "dangling", the container
is left running even though it failed to restore?
Yep, the container is still running during the nomad client process restart.
1. The restore is failed by an unsuccessful CSI hook(
https://github.com/hashicorp/nomad/blob/v1.2.6/client/allocrunner/alloc_runner.go#L321
)
2. if understood correctly, It thus skips runtasks at line 333, which would
like to do the bookkeeping.
… —
Reply to this email directly, view it on GitHub
<#12265 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEKPIDQCR2BL2H6AW3XE5DU75LJDANCNFSM5QPGYQ5Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Besides a re-architecture, a delay/second retry for failed allocs may just work. In https://github.com/hashicorp/nomad/blob/v1.2.6/client/client.go#L1118 , collect the failed allocs and retry a second time in a next for loop? |
There is another possible issues, during client restart. In cisHook of claimWithRetry(https://github.com/hashicorp/nomad/blob/main/client/allocrunner/csi_hook.go#L259) It can failed contacting servers if
|
If we return any error in prerun, we skip ahead to postrun, which is where we do all the cleanup bookkeeping. So we should be ok there. I'll verify that we handle that case correctly though, as it's possible that something in postrun ends up breaking in that code path.
The |
#11477 similar issue. |
This is with a build from |
no, it is from apt v1.2.6
…On Wed, Mar 16, 2022 at 9:13 PM Tim Gross ***@***.***> wrote:
below shows a failed restore container after client restart
This is with a build from main?
—
Reply to this email directly, view it on GitHub
<#12265 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEKPIDORL3T7C36SASXV3LVAHM7FANCNFSM5QPGYQ5Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Ok great! That case is covered with the new isRetryableClaimRPCError function. That code hasn't shipped in a release yet. It'll be in Nomad 1.3.0 and will get backported to 1.2.x when it does. Thanks! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.2.6 (a6c6b47)
Operating system and Environment details
Ubuntu 16.04.6 LTS
Issue
Reproduction steps
Expected Result
CSI Plugin should be register and ready before restoring any other tasks/allocs
Actual Result
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: