client: fix gc deadlock when ar.prerun errors #5861

schmichael · 2019-06-20T05:37:32Z

Repro by making ar.prerun return an error. Without this patch ar.killTasks (which is called by the client GC) blocks forever.

notnoop · 2019-06-27T04:36:23Z

I can take over this one! This is quite related to #5890 . One issue I noticed and added a test for is that if ar.runTasks() isn't called, shutting down the client blocks forever in Shutdown(), as it waits on the taskRunner waitCh.

When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in #5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for #5840 .

notnoop · 2019-06-29T15:19:19Z

I made an alternative way to address this problem in #5905 . The issue is a more significant impacting issue than perceived here.

When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in #5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for #5840 .

When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in hashicorp#5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for hashicorp#5840 .

client: fix gc deadlock when ar.prerun errors

661b759

notnoop mentioned this pull request Jun 29, 2019

Fail alloc if alloc runner prestart hooks fail #5905

Merged

schmichael closed this Jul 2, 2019

schmichael deleted the b-gc-deadlock branch January 25, 2023 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: fix gc deadlock when ar.prerun errors #5861

client: fix gc deadlock when ar.prerun errors #5861

schmichael commented Jun 20, 2019

notnoop commented Jun 27, 2019

notnoop commented Jun 29, 2019

client: fix gc deadlock when ar.prerun errors #5861

client: fix gc deadlock when ar.prerun errors #5861

Conversation

schmichael commented Jun 20, 2019

notnoop commented Jun 27, 2019

notnoop commented Jun 29, 2019