task runner to avoid running task if terminal #5890

notnoop · 2019-06-26T15:50:01Z

This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that tr.Run() is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes #5883

notnoop · 2019-06-26T15:53:31Z

client/allocrunner/taskrunner/task_runner.go

+	dead := tr.state.State == structs.TaskStateDead
+	tr.stateLock.RUnlock()
+
+	if dead {


here I only check if the task itself is dead - I suspect we should be checking if the restore alloc had a terminated alloc state. I suspect that an alloc with tasks with mixed status causes some some complications?

Actually, this is the right behavior. An alloc is considering running if one task completes, and all allocs will be killed if leader task dies or a task failed enough times. Until that happens, we should treat other tasks as running.

This change fixes a bug where nomad would avoid running alloc tasks if the alloc is client terminal but the server copy on the client isn't marked as running. Here, we fix the case by having task runner uses the allocRunner.shouldRun() instead of only checking the server updated alloc. Here, we preserve much of the invariants such that `tr.Run()` is always run, and don't change the overall alloc runner and task runner lifecycles. Fixes #5883

schmichael

Great catch!

schmichael · 2019-07-01T16:03:19Z

client/allocrunner/alloc_runner_unix_test.go

+
+// TestAllocRunner_Restore_Completed asserts that restoring a completed
+// batch alloc doesn't run it again
+func TestAllocRunner_Restore_CompletedBatch(t *testing.T) {


Name/comment mismatch

Test looks good, but just to verify:

Does it fail without your fixes?

Does it pass with -race?

Yes, it passes with -race and was failing before - here is a sample build failure [1] when adding test alone. The failure snippet is:

goroutine 87 [chan receive, 14 minutes]: github.com/hashicorp/nomad/client/allocrunner.destroy(0xc000342780) /home/travis/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner_test.go:27 +0x54 runtime.Goexit() /home/travis/.gimme/versions/go1.12.6.linux.amd64/src/runtime/panic.go:406 +0xed testing.(*common).FailNow(0xc000449b00) /home/travis/.gimme/versions/go1.12.6.linux.amd64/src/testing/testing.go:609 +0x39 github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/require.Fail(0x18348e0, 0xc000449b00, 0x15fc0e0, 0x1a, 0x0, 0x0, 0x0) /home/travis/gopath/src/github.com/hashicorp/nomad/vendor/github.com/stretchr/testify/require/require.go:285 +0xf0 github.com/hashicorp/nomad/client/allocrunner.TestAllocRunner_Restore_CompletedBatch(0xc000449b00) /home/travis/gopath/src/github.com/hashicorp/nomad/client/allocrunner/alloc_runner_unix_test.go:204 +0xb22 testing.tRunner(0xc000449b00, 0x1639ae0) /home/travis/.gimme/versions/go1.12.6.linux.amd64/src/testing/testing.go:865 +0xc0 created by testing.(*T).Run /home/travis/.gimme/versions/go1.12.6.linux.amd64/src/testing/testing.go:916 +0x35a

As seen in stack trace, we fail in line 204 [1] because AR.wait() times out, then times out again in destroy defer call.

I'll follow up in another PR to change the destroy defer call so that it errors rather than blocks indefinitely on failures to make tracking these errors better.

[1] https://travis-ci.org/hashicorp/nomad/jobs/553113545
[2] https://github.com/hashicorp/nomad/compare/b-dont-start-completed-allocs-2-test-only?expand=1
[3] https://github.com/hashicorp/nomad/compare/b-dont-start-completed-allocs-2-test-only?expand=1#diff-41decefd2f35059b5c0b95166e275653R204

schmichael · 2019-07-01T22:35:53Z

client/allocrunner/taskrunner/task_runner.go

+		if err := tr.stop(); err != nil {
+			tr.logger.Error("stop failed on terminal task", "error", err)
+		}
+		return


I think we may also want to call tr.TaskStateUpdated() since task states are persisted before the AR is notified. Therefore I think the following could happen:

2 tasks in an alloc start: a leader service, and a sidecar

Leader task exits, persists TaskStateDead

agent crashes before TaskStateUpdated is called

agent restarts, returns here due to TaskStateDead

At this point I do not think anything will have told the sidecar service to exit despite its leader dying. If you call TaskStateUpdated here, then all of the leader died detection logic in AR will be run: https://github.com/hashicorp/nomad/blob/master/client/allocrunner/alloc_runner.go#L415-L438

This could be done in a followup PR as well since I think your changes improve the situation.

This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes #5984 Related to #5890

github-actions · 2023-02-07T02:15:42Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop force-pushed the b-dont-start-completed-allocs-2 branch from 8603028 to f42aa1e Compare June 26, 2019 15:51

notnoop commented Jun 26, 2019

View reviewed changes

notnoop force-pushed the b-dont-start-completed-allocs-2 branch from f42aa1e to f3c944a Compare June 27, 2019 03:27

notnoop mentioned this pull request Jun 27, 2019

client: fix gc deadlock when ar.prerun errors #5861

Closed

schmichael approved these changes Jul 1, 2019

View reviewed changes

notnoop pushed a commit that referenced this pull request Jul 2, 2019

Tests only of GH-5890

e9d2fc1

address review comments

009f186

notnoop merged commit bd7d60e into master Jul 2, 2019

notnoop deleted the b-dont-start-completed-allocs-2 branch July 2, 2019 07:31

This was referenced Jul 9, 2019

Restarting agent reruns successfully completed allocations #5945

Closed

Nomad allows two copies of itself on one machine #5942

Closed

chrisboulton mentioned this pull request Jul 15, 2019

Crash on restart with 0.9.1 #5840

Closed

langmartin mentioned this pull request Jul 19, 2019

Runaway nomad process after Nomad client reboot #5984

Closed

notnoop mentioned this pull request Aug 25, 2019

Don't persist allocs of destroyed alloc runners #6207

Merged

github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task runner to avoid running task if terminal #5890

task runner to avoid running task if terminal #5890

notnoop commented Jun 26, 2019

notnoop Jun 26, 2019

notnoop Jun 27, 2019

schmichael left a comment

schmichael Jul 1, 2019

schmichael Jul 1, 2019

notnoop Jul 2, 2019 •

edited

Loading

schmichael Jul 1, 2019

github-actions bot commented Feb 7, 2023

task runner to avoid running task if terminal #5890

task runner to avoid running task if terminal #5890

Conversation

notnoop commented Jun 26, 2019

notnoop Jun 26, 2019

Choose a reason for hiding this comment

notnoop Jun 27, 2019

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

schmichael Jul 1, 2019

Choose a reason for hiding this comment

schmichael Jul 1, 2019

Choose a reason for hiding this comment

notnoop Jul 2, 2019 • edited Loading

Choose a reason for hiding this comment

schmichael Jul 1, 2019

Choose a reason for hiding this comment

github-actions bot commented Feb 7, 2023

notnoop Jul 2, 2019 •

edited

Loading