Don't treat a failed recover + successful destroy as a successful recover #10849

benbuzbee · 2021-07-03T01:48:26Z

This code just seems incorrect. As it stands today it reports a
successful restore if RecoverTask fails and then DestroyTask succeeds.

This can result in a really annoying bug where it then calls RecoverTask
again, whereby it will probably get ErrTaskNotFound and call DestroyTask
once more.

I think the only reason this has not been noticed so far is because most
drivers like Docker will return Success, then nomad will call
RecoverTask, get an error (not found) and call DestroyTask again, and
get a ErrTasksNotFound err.

recover This code just seems incorrect. As it stands today it reports a successful restore if RecoverTask fails and then DestroyTask succeeds. This can result in a really annoying bug where it then calls RecoverTask again, whereby it will probably get ErrTaskNotFound and call DestroyTask once more. I think the only reason this has not been noticed so far is because most drivers like Docker will return Success, then nomad will call RecoverTask, get an error (not found) and call DestroyTask again, and get a ErrTasksNotFound err.

benbuzbee · 2021-07-03T02:02:15Z

There is a comment in this file that makes me hesitate and think this is by design
"// Try to cleanup any existing task state in the plugin before restarting"

But I do not understand how this can possibly work? It's just going to keep calling RecoverTask, which will fail, then DestroyTask, until it gets an error from DestroyTask. Why would you try to recover a task that you have successfully destroyed?

tgross · 2021-07-06T12:59:53Z

@benbuzbee I'm not sure what you mean by "It's just going to keep calling RecoverTask".

The Restore call which calls tr.restoreHandle which calls RecoverTask gets called only once, when the allocation runner is started back up after a client restart. It attempts to restore a handle to an already-running process so that when the Nomad client restarts it can rebuild its internal state with the running processes.

So we call RecoverTask, and if we get any error except "hey the task isn't running!" we tell Restore that we didn't recover a task, so we're going to wait till we get an update from the server to see what to do next. That's what I'd expect to see (although why we return false twice instead of falling through and doing it once is probably just a style quirk of whomever wrote it originally).

Can I ask what behavior you're seeing that you're trying to fix here?

benbuzbee · 2021-07-06T15:58:14Z

Can I ask what behavior you're seeing that you're trying to fix here?

Here's a simple example, imagine an external that always successfully destroys a task but cannot recover them

func (d *Driver) RecoverTask(handle *drivers.TaskHandle) (err error) {
	return errors.New("Test - Recover Task Fails")
}
func (d *Driver) DestroyTask(taskID string, force bool) error {
	return nil
}

This emulates a real world scenario where the recovery cannot happen and destroying succeeds.

Run a single job with that plugin, then emulate an external plugin crash/panic

sudo pkill -f "/plugins/" --signal SIGKILL

As a result nomad will try to recover/destroy in a loop seemingly forever

$ cat out.log | grep "error recovering task" | wc -l
1873

    2021-07-06T15:44:54.473Z [DEBUG] client.driver_mgr: plugin process exited: driver=cloudchamber path=/home/vagrant/core/src/third_party/cloudchamber-nomad-driver/plugins/cloudchamber pid=360180 error="signal: killed"
    2021-07-06T15:44:54.473Z [ERROR] client.driver_mgr: error receiving stream from TaskStats driver RPC, closing stream: driver=cloudchamber error="rpc error: code = Unavailable desc = transport is closing"
    2021-07-06T15:44:54.474Z [WARN]  client.driver_mgr: failed to reattach to plugin, starting new instance: driver=cloudchamber err="singleton plugin exited"
    2021-07-06T15:44:54.474Z [DEBUG] client.alloc_runner.task_runner: driver plugin has shutdown; attempting to recover task: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.474Z [WARN]  client.driver_mgr: received fingerprint error from driver: driver=cloudchamber error="plugin is shut down"
    2021-07-06T15:44:54.474Z [DEBUG] client.driver_mgr: starting plugin: driver=cloudchamber path=/home/vagrant/core/src/third_party/cloudchamber-nomad-driver/plugins/cloudchamber args=[/home/vagrant/core/src/third_party/cloudchamber-nomad-driver/plugins/cloudchamber]
    2021-07-06T15:44:54.475Z [DEBUG] client.driver_mgr: plugin started: driver=cloudchamber path=/home/vagrant/core/src/third_party/cloudchamber-nomad-driver/plugins/cloudchamber pid=361504
    2021-07-06T15:44:54.475Z [DEBUG] client.driver_mgr: waiting for RPC address: driver=cloudchamber path=/home/vagrant/core/src/third_party/cloudchamber-nomad-driver/plugins/cloudchamber
    2021-07-06T15:44:54.521Z [DEBUG] client.driver_mgr: using plugin: driver=cloudchamber version=2
    2021-07-06T15:44:54.521Z [DEBUG] client.driver_mgr.cloudchamber: plugin address: driver=cloudchamber address=/tmp/plugin664592638 network=unix timestamp=2021-07-06T15:44:54.521Z
    2021-07-06T15:44:54.549Z [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 error="rpc error: code = Unknown desc = Test - Recover Task Fails" task_id=63139173-756d-808c-89ed-fb11c595ca76/server2/d21d0213
    2021-07-06T15:44:54.550Z [DEBUG] client.alloc_runner.task_runner: task successfully recovered on driver: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.550Z [DEBUG] client.alloc_runner.task_runner: driver plugin has shutdown; attempting to recover task: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.555Z [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 error="rpc error: code = Unknown desc = Test - Recover Task Fails" task_id=63139173-756d-808c-89ed-fb11c595ca76/server2/d21d0213
    2021-07-06T15:44:54.556Z [DEBUG] client.alloc_runner.task_runner: task successfully recovered on driver: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.556Z [DEBUG] client.alloc_runner.task_runner: driver plugin has shutdown; attempting to recover task: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.561Z [INFO]  client.driver_mgr.cloudchamber: Recovering task: driver=cloudchamber version=2021.6.2-17-gfae93d4-dev driver_pid=361504 task_name=server2 driver_timestamp=2021-07-06T15:44:54.561Z module=cloudchamber build_time=2021-07-06-15:40UTC task_id=63139173-756d-808c-89ed-fb11c595ca76/server2/d21d0213 alloc_id=63139173-756d-808c-89ed-fb11c595ca76 timestamp=0001-01-01T00:00:00.000Z
    2021-07-06T15:44:54.561Z [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 error="rpc error: code = Unknown desc = Test - Recover Task Fails" task_id=63139173-756d-808c-89ed-fb11c595ca76/server2/d21d0213
    2021-07-06T15:44:54.561Z [DEBUG] client.alloc_runner.task_runner: task successfully recovered on driver: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.562Z [DEBUG] client.alloc_runner.task_runner: driver plugin has shutdown; attempting to recover task: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber

Two interesting things from the logs

driver plugin has shutdown; attempting to recover task

This isn't true, the driver does not shutdown after the first kill. I know this also because our own logs include the process ID and I can see that each time it calls RecoverTask and DestroyTask the process ID remains the same.

2021-07-06T15:44:54.561Z [DEBUG] client.alloc_runner.task_runner: task successfully recovered on driver: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber

This makes no sense, recovery did not succeed, it is hard coded to fail

(although why we return false twice instead of falling through and doing it once is probably just a style quirk of whomever wrote it originally).

I think you might misunderstand the PR, the original code does NOT return false twice, it returns TRUE where (I believe) it should not

In pseudocode the current code looks like this:

func RestoreHandle() bool {
   if RecoverTask fails {
       if DestroyTask fails {
          return err != drivers.ErrTaskNotFound
       }
       return true
   }
   return true
}

What I think is the bug here is the "return true" that happens INSIDE "if RecoverTask fails". I'm guessing that's the root of that log line that didnt make sense to me "task successfully recovered on driver"

tgross · 2021-07-06T18:07:57Z

I think you might misunderstand the PR, the original code does NOT return false twice, it returns TRUE where (I believe) it should not

Oops, no I got that... just a typo from rewriting my reply too many times 😀 Thank you for that detailed example!

I did a little exploring of the original code that was introduced in 740ca8e and what's interesting there is we have the comment "Ignore ErrTaskNotFound errors" but no means to actually do so because the function has no return value at that time. The return value is introduced a few weeks later in 709b8e1 and it looks a heck of a lot like that return false was just misplaced and should be only on the "Ignore ErrTaskNotFound errors" path.

@schmichael I want to ping you on this too just because you were involved in the original but this PR LGTM

notnoop

The change LGTM as well but defer to @schmichael in case of missing context.

driver plugin has shutdown; attempting to recover task

This isn't true, the driver does not shutdown after the first kill

This is troubling and we should fix it. I wonder if it's related to #10494 . What version are you running?

As a meta comment for ongoing discussion, success boolean in restoreHandler is either ill-defined or at least doesn't capture the information well. There are multiple outcomes: "successfully reattached", "failed to reattach because process died and no longer exists", "failed to reattach but succeeded in destroying process", "failed to reattach, and process got leaked and no longer managed". These failures should be handled and messaged to users differently, and a boolean indicator isn't sufficient.

schmichael

https://github.com/hashicorp/nomad/pull/10849/files#diff-709b8e134a3b248f36e0ef29af11594669d901111c9e9340f867e4dbde1e8c9bR1115-R1117

is a similarly weird line of code. I'm unsure if it's never hit and therefore doesn't matter, or if it's actually doing the right thing, or if it's doing the wrong thing but in a way that is rare or hard to notice.

Regardless @notnoop is absolutely right that (success bool) is at best a misleadingly named return value. It seems more likely there are a few states (restored, nothing to restore, something to restore but its invalid) represented here collapsed into 2 (true|false).

On agent startup this change will cause waitOnServers=true if RecoverTask fails for a non-terminal allocation which I think is fine. I'm not sure it's necessary, but it feels like the conservative approach which I'm all for given the complexity of this logic.

Blah blah blah, all that to say I'm +1 on this PR. Thanks for your patience and determination @benbuzbee! If we find further complications around this logic, it will be because this func's results were nonsensical and just happened to be working. I'd rather keeping moving the code forward than reject the PR and force plugin authors to live with this incoherent behavior.

schmichael · 2021-07-16T23:01:21Z

I pushed a changelog entry and will merge on Monday if there are no complaints. I tried to keep it succinct which unfortunately makes it awfully vague, but the audience (task driver plugin authors) is relatively small and probably feels comfortable hopping over to this PR and joining in our shared 🤔

Don't treat a failed recover + successful destroy as a successful recover

github-actions · 2022-10-20T02:46:04Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

vercel bot deployed to Preview – nomad-storybook-and-ui July 3, 2021 01:48 View deployment

vercel bot temporarily deployed to Preview – nomad July 3, 2021 01:48 Inactive

benbuzbee changed the title ~~Don't treat a failed recover + successful destroy as a successful~~ Don't treat a failed recover + successful destroy as a successful recover Jul 3, 2021

tgross requested a review from schmichael July 6, 2021 18:08

tgross approved these changes Jul 6, 2021

View reviewed changes

tgross added this to the 1.1.3 milestone Jul 6, 2021

notnoop approved these changes Jul 14, 2021

View reviewed changes

schmichael approved these changes Jul 16, 2021

View reviewed changes

docs: add changelog entry for hashicorp#10849

f2fddc9

vercel bot temporarily deployed to Preview – nomad July 16, 2021 22:59 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui July 16, 2021 22:59 View deployment

schmichael merged commit 179e88c into hashicorp:main Jul 19, 2021

notnoop added backport/1.0 labels Jul 28, 2021

notnoop mentioned this pull request Jul 28, 2021

update changelog #10963

Merged

notnoop pushed a commit that referenced this pull request Jul 28, 2021

Merge pull request #10849 from benbuzbee/benbuz/fix-destroy

529fac0

Don't treat a failed recover + successful destroy as a successful recover

notnoop pushed a commit that referenced this pull request Jul 29, 2021

Merge pull request #10849 from benbuzbee/benbuz/fix-destroy

af0b55e

Don't treat a failed recover + successful destroy as a successful recover

schmichael removed the stage/needs-backporting label Sep 17, 2021

benbuzbee pushed a commit to benbuzbee/nomad that referenced this pull request Oct 6, 2021

Merge pull request hashicorp#10849 from benbuzbee/benbuz/fix-destroy

a337fe0

Don't treat a failed recover + successful destroy as a successful recover

lgfa29 removed the backport/1.0 label Apr 14, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't treat a failed recover + successful destroy as a successful recover #10849

Don't treat a failed recover + successful destroy as a successful recover #10849

benbuzbee commented Jul 3, 2021 •

edited

Loading

benbuzbee commented Jul 3, 2021

tgross commented Jul 6, 2021

benbuzbee commented Jul 6, 2021 •

edited

Loading

tgross commented Jul 6, 2021 •

edited

Loading

notnoop left a comment

schmichael left a comment

schmichael commented Jul 16, 2021

github-actions bot commented Oct 20, 2022

Don't treat a failed recover + successful destroy as a successful recover #10849

Don't treat a failed recover + successful destroy as a successful recover #10849

Conversation

benbuzbee commented Jul 3, 2021 • edited Loading

benbuzbee commented Jul 3, 2021

tgross commented Jul 6, 2021

benbuzbee commented Jul 6, 2021 • edited Loading

tgross commented Jul 6, 2021 • edited Loading

notnoop left a comment

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

schmichael commented Jul 16, 2021

github-actions bot commented Oct 20, 2022

benbuzbee commented Jul 3, 2021 •

edited

Loading

benbuzbee commented Jul 6, 2021 •

edited

Loading

tgross commented Jul 6, 2021 •

edited

Loading