Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't treat a failed recover + successful destroy as a successful recover #10849

Merged
merged 2 commits into from
Jul 19, 2021

Conversation

benbuzbee
Copy link
Contributor

@benbuzbee benbuzbee commented Jul 3, 2021

This code just seems incorrect. As it stands today it reports a
successful restore if RecoverTask fails and then DestroyTask succeeds.

This can result in a really annoying bug where it then calls RecoverTask
again, whereby it will probably get ErrTaskNotFound and call DestroyTask
once more.

I think the only reason this has not been noticed so far is because most
drivers like Docker will return Success, then nomad will call
RecoverTask, get an error (not found) and call DestroyTask again, and
get a ErrTasksNotFound err.

recover

This code just seems incorrect. As it stands today it reports a
successful restore if RecoverTask fails and then DestroyTask succeeds.

This can result in a really annoying bug where it then calls RecoverTask
again, whereby it will probably get ErrTaskNotFound and call DestroyTask
once more.

I think the only reason this has not been noticed so far is because most
drivers like Docker will return Success, then nomad will call
RecoverTask, get an error (not found) and call DestroyTask again, and
get a ErrTasksNotFound err.
@vercel vercel bot temporarily deployed to Preview – nomad July 3, 2021 01:48 Inactive
@benbuzbee benbuzbee changed the title Don't treat a failed recover + successful destroy as a successful Don't treat a failed recover + successful destroy as a successful recover Jul 3, 2021
@benbuzbee
Copy link
Contributor Author

There is a comment in this file that makes me hesitate and think this is by design
"// Try to cleanup any existing task state in the plugin before restarting"

But I do not understand how this can possibly work? It's just going to keep calling RecoverTask, which will fail, then DestroyTask, until it gets an error from DestroyTask. Why would you try to recover a task that you have successfully destroyed?

@tgross
Copy link
Member

tgross commented Jul 6, 2021

@benbuzbee I'm not sure what you mean by "It's just going to keep calling RecoverTask".

The Restore call which calls tr.restoreHandle which calls RecoverTask gets called only once, when the allocation runner is started back up after a client restart. It attempts to restore a handle to an already-running process so that when the Nomad client restarts it can rebuild its internal state with the running processes.

So we call RecoverTask, and if we get any error except "hey the task isn't running!" we tell Restore that we didn't recover a task, so we're going to wait till we get an update from the server to see what to do next. That's what I'd expect to see (although why we return false twice instead of falling through and doing it once is probably just a style quirk of whomever wrote it originally).

Can I ask what behavior you're seeing that you're trying to fix here?

@benbuzbee
Copy link
Contributor Author

benbuzbee commented Jul 6, 2021

Can I ask what behavior you're seeing that you're trying to fix here?

Here's a simple example, imagine an external that always successfully destroys a task but cannot recover them

func (d *Driver) RecoverTask(handle *drivers.TaskHandle) (err error) {
	return errors.New("Test - Recover Task Fails")
}
func (d *Driver) DestroyTask(taskID string, force bool) error {
	return nil
}

This emulates a real world scenario where the recovery cannot happen and destroying succeeds.

Run a single job with that plugin, then emulate an external plugin crash/panic

sudo pkill -f "/plugins/" --signal SIGKILL

As a result nomad will try to recover/destroy in a loop seemingly forever

$ cat out.log | grep "error recovering task" | wc -l
1873
    2021-07-06T15:44:54.473Z [DEBUG] client.driver_mgr: plugin process exited: driver=cloudchamber path=/home/vagrant/core/src/third_party/cloudchamber-nomad-driver/plugins/cloudchamber pid=360180 error="signal: killed"
    2021-07-06T15:44:54.473Z [ERROR] client.driver_mgr: error receiving stream from TaskStats driver RPC, closing stream: driver=cloudchamber error="rpc error: code = Unavailable desc = transport is closing"
    2021-07-06T15:44:54.474Z [WARN]  client.driver_mgr: failed to reattach to plugin, starting new instance: driver=cloudchamber err="singleton plugin exited"
    2021-07-06T15:44:54.474Z [DEBUG] client.alloc_runner.task_runner: driver plugin has shutdown; attempting to recover task: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.474Z [WARN]  client.driver_mgr: received fingerprint error from driver: driver=cloudchamber error="plugin is shut down"
    2021-07-06T15:44:54.474Z [DEBUG] client.driver_mgr: starting plugin: driver=cloudchamber path=/home/vagrant/core/src/third_party/cloudchamber-nomad-driver/plugins/cloudchamber args=[/home/vagrant/core/src/third_party/cloudchamber-nomad-driver/plugins/cloudchamber]
    2021-07-06T15:44:54.475Z [DEBUG] client.driver_mgr: plugin started: driver=cloudchamber path=/home/vagrant/core/src/third_party/cloudchamber-nomad-driver/plugins/cloudchamber pid=361504
    2021-07-06T15:44:54.475Z [DEBUG] client.driver_mgr: waiting for RPC address: driver=cloudchamber path=/home/vagrant/core/src/third_party/cloudchamber-nomad-driver/plugins/cloudchamber
    2021-07-06T15:44:54.521Z [DEBUG] client.driver_mgr: using plugin: driver=cloudchamber version=2
    2021-07-06T15:44:54.521Z [DEBUG] client.driver_mgr.cloudchamber: plugin address: driver=cloudchamber address=/tmp/plugin664592638 network=unix timestamp=2021-07-06T15:44:54.521Z
    2021-07-06T15:44:54.549Z [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 error="rpc error: code = Unknown desc = Test - Recover Task Fails" task_id=63139173-756d-808c-89ed-fb11c595ca76/server2/d21d0213
    2021-07-06T15:44:54.550Z [DEBUG] client.alloc_runner.task_runner: task successfully recovered on driver: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.550Z [DEBUG] client.alloc_runner.task_runner: driver plugin has shutdown; attempting to recover task: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.555Z [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 error="rpc error: code = Unknown desc = Test - Recover Task Fails" task_id=63139173-756d-808c-89ed-fb11c595ca76/server2/d21d0213
    2021-07-06T15:44:54.556Z [DEBUG] client.alloc_runner.task_runner: task successfully recovered on driver: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.556Z [DEBUG] client.alloc_runner.task_runner: driver plugin has shutdown; attempting to recover task: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.561Z [INFO]  client.driver_mgr.cloudchamber: Recovering task: driver=cloudchamber version=2021.6.2-17-gfae93d4-dev driver_pid=361504 task_name=server2 driver_timestamp=2021-07-06T15:44:54.561Z module=cloudchamber build_time=2021-07-06-15:40UTC task_id=63139173-756d-808c-89ed-fb11c595ca76/server2/d21d0213 alloc_id=63139173-756d-808c-89ed-fb11c595ca76 timestamp=0001-01-01T00:00:00.000Z
    2021-07-06T15:44:54.561Z [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 error="rpc error: code = Unknown desc = Test - Recover Task Fails" task_id=63139173-756d-808c-89ed-fb11c595ca76/server2/d21d0213
    2021-07-06T15:44:54.561Z [DEBUG] client.alloc_runner.task_runner: task successfully recovered on driver: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber
    2021-07-06T15:44:54.562Z [DEBUG] client.alloc_runner.task_runner: driver plugin has shutdown; attempting to recover task: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber

Two interesting things from the logs

driver plugin has shutdown; attempting to recover task

This isn't true, the driver does not shutdown after the first kill. I know this also because our own logs include the process ID and I can see that each time it calls RecoverTask and DestroyTask the process ID remains the same.

2021-07-06T15:44:54.561Z [DEBUG] client.alloc_runner.task_runner: task successfully recovered on driver: alloc_id=63139173-756d-808c-89ed-fb11c595ca76 task=server2 driver=cloudchamber

This makes no sense, recovery did not succeed, it is hard coded to fail

(although why we return false twice instead of falling through and doing it once is probably just a style quirk of whomever wrote it originally).

I think you might misunderstand the PR, the original code does NOT return false twice, it returns TRUE where (I believe) it should not

In pseudocode the current code looks like this:

func RestoreHandle() bool {
   if RecoverTask fails {
       if DestroyTask fails {
          return err != drivers.ErrTaskNotFound
       }
       return true
   }
   return true
}

What I think is the bug here is the "return true" that happens INSIDE "if RecoverTask fails". I'm guessing that's the root of that log line that didnt make sense to me "task successfully recovered on driver"

@tgross
Copy link
Member

tgross commented Jul 6, 2021

I think you might misunderstand the PR, the original code does NOT return false twice, it returns TRUE where (I believe) it should not

Oops, no I got that... just a typo from rewriting my reply too many times 😀 Thank you for that detailed example!

I did a little exploring of the original code that was introduced in 740ca8e and what's interesting there is we have the comment "Ignore ErrTaskNotFound errors" but no means to actually do so because the function has no return value at that time. The return value is introduced a few weeks later in 709b8e1 and it looks a heck of a lot like that return false was just misplaced and should be only on the "Ignore ErrTaskNotFound errors" path.

@schmichael I want to ping you on this too just because you were involved in the original but this PR LGTM

@tgross tgross requested a review from schmichael July 6, 2021 18:08
@tgross tgross added this to the 1.1.3 milestone Jul 6, 2021
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM as well but defer to @schmichael in case of missing context.

driver plugin has shutdown; attempting to recover task

This isn't true, the driver does not shutdown after the first kill

This is troubling and we should fix it. I wonder if it's related to #10494 . What version are you running?

As a meta comment for ongoing discussion, success boolean in restoreHandler is either ill-defined or at least doesn't capture the information well. There are multiple outcomes: "successfully reattached", "failed to reattach because process died and no longer exists", "failed to reattach but succeeded in destroying process", "failed to reattach, and process got leaked and no longer managed". These failures should be handled and messaged to users differently, and a boolean indicator isn't sufficient.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/hashicorp/nomad/pull/10849/files#diff-709b8e134a3b248f36e0ef29af11594669d901111c9e9340f867e4dbde1e8c9bR1115-R1117

is a similarly weird line of code. I'm unsure if it's never hit and therefore doesn't matter, or if it's actually doing the right thing, or if it's doing the wrong thing but in a way that is rare or hard to notice.

Regardless @notnoop is absolutely right that (success bool) is at best a misleadingly named return value. It seems more likely there are a few states (restored, nothing to restore, something to restore but its invalid) represented here collapsed into 2 (true|false).

On agent startup this change will cause waitOnServers=true if RecoverTask fails for a non-terminal allocation which I think is fine. I'm not sure it's necessary, but it feels like the conservative approach which I'm all for given the complexity of this logic.

Blah blah blah, all that to say I'm +1 on this PR. Thanks for your patience and determination @benbuzbee! If we find further complications around this logic, it will be because this func's results were nonsensical and just happened to be working. I'd rather keeping moving the code forward than reject the PR and force plugin authors to live with this incoherent behavior.

@schmichael
Copy link
Member

I pushed a changelog entry and will merge on Monday if there are no complaints. I tried to keep it succinct which unfortunately makes it awfully vague, but the audience (task driver plugin authors) is relatively small and probably feels comfortable hopping over to this PR and joining in our shared 🤔

@schmichael schmichael merged commit 179e88c into hashicorp:main Jul 19, 2021
@notnoop notnoop mentioned this pull request Jul 28, 2021
notnoop pushed a commit that referenced this pull request Jul 28, 2021
Don't treat a failed recover + successful destroy as a successful recover
notnoop pushed a commit that referenced this pull request Jul 29, 2021
Don't treat a failed recover + successful destroy as a successful recover
benbuzbee pushed a commit to benbuzbee/nomad that referenced this pull request Oct 6, 2021
Don't treat a failed recover + successful destroy as a successful recover
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants