periodic: always reset periodic children status #10145

notnoop · 2021-03-09T16:27:26Z

Fixes a bug where Nomad reports negative or incorrect running children
counts for periodic jobs.

The periodic dispatcher derives a child job without reseting the status.
If the periodic job has a running status, the derived job will start
as running status and transition to pending. Since this is
unexpected transition, the counting in StateStore.setJobSummary gets out of sync and
result in negative/incorrect values.

Note that this only affects periodic jobs after a leader transition.
During the first job registration, the job is added with pending or
"" status. However, after a leader transition, the new leader
repopulates the dispatcher heap with "running" status and triggers the
bug.

Alternative implementation

I have considered updating the FSM handler so that job registration event always resets the job status. I'm nervous of such change that will re-interpret already commited log entries, resulting into unintended changes or discrepancy about the server state during upgrades that may depend on the state of raft logs and snapshots.

Debugging notes

An internal test cluster encountered this bug. The job horizontal_hey started having negative running children counts at index 1609. The following output highlights the issue, and also demonstrate that index 1609 is the first job registration after leader election (term 3) and shows a "running" status, unlike the empty status in term 2.

$ IDX=1608; nomad operator raft _state --last-index $IDX . | jq '.JobSummaries[] | select(.JobID=="horizontal_hey") | .Children'
{
  "Pending": 0,
  "Running": 0,
  "Dead": 12
}
$ IDX=1609; nomad operator raft _state --last-index $IDX . | jq '.JobSummaries[] | select(.JobID=="horizontal_hey") | .Children'
{
  "Pending": 1,
  "Running": -1,
  "Dead": 12
}
$ nomad operator raft _logs . \
>   | jq -r '.[] | select(.CommandType == "JobRegisterRequestType")
>                | select(.Body.Job.ParentID == "horizontal_hey")
>                | [.Term, .Index, .Body.Job.Status]
>                | @csv' \
>   | head -n30
2,121,""
2,248,""
2,351,""
2,460,""
2,573,""
2,850,""
2,955,""
2,1063,""
2,1173,""
2,1281,""
2,1395,""
2,1504,""
3,1609,"running"
3,1668,"running"
3,1747,"running"
3,1829,"running"
3,1917,"running"
3,2002,"running"
3,2084,"running"
3,2170,"running"
3,2248,"running"
3,2334,"running"
3,2435,"running"
3,2518,"running"
3,2609,"running"
3,2698,"running"
3,2784,"running"
3,2871,"running"
3,2959,"running"
3,3045,"running"
$ IDX=3045; nomad operator raft _state --last-index $IDX . | jq '.JobSummaries[] | select(.JobID=="horizontal_hey") | .Children'
{
  "Pending": 1,
  "Running": -18,
  "Dead": 29
}

tgross

LGTM.

I took a pass through the rest of structs.Job to see if there was anything else we needed to hit. Looks like ModifyIndex and CreateIndex will get handled when we upsert into raft. (Multiregion also gave me pause, but a job that's been registered will already have the "multiregion interpolated" version).

tgross · 2021-03-09T16:54:41Z

nomad/periodic_test.go

+
+	dispatched := m.dispatchedJobs(job)
+	require.NotEmpty(t, dispatched)
+	require.Empty(t, dispatched[0].Status)


Up to you, but would verifying that the status eventually does get set correctly be a lot of lift for this test? Just to make sure we're not relying on a side-effect of the bug.

I have tried to make an e2e or have a higher level integration tests, but I fear it'll be flaky and a slow test: it depends on a periodic job triggering and running after a leadership transition.

Alternatively, we could force the conditions by manipulating the job store directly: force the inserted job to be running state, and force job update. After the scaffolding, it's not obvious to me if the resulting test is better at protecting against a regression than this one, though it's going to be slower.

tgross · 2021-03-09T21:26:09Z

If you rebase on main you'll get the fix for lint-go @notnoop

jrasell

LGTM, thanks for great info and the quick work!

tgross · 2021-03-24T18:19:13Z

Potentially related issue: #10222

Fixes a bug where Nomad reports negative or incorrect running children counts for periodic jobs. The periodic dispatcher derives a child job without reseting the status. If the periodic job has a `running` status, the derived job will start as `running` status and transition to `pending`. Since this is unexpected transition, the counting in StateStore.setJobSummary gets out of sync and result in negative/incorrect values. Note that this only affects periodic jobs after a leader transition. During the first job registration, the job is added with `pending` or `""` status. However, after a leader transition, the new leader repopulates the dispatcher heap with `"running"` status and triggers the bug.

notnoop · 2021-03-25T20:03:53Z

I've updated the PR to resets the status for dispatch jobs. It's another manifestation of the a 1.0.3 regression bug with a failing test added in 032945b where the failure is seen in https://app.circleci.com/pipelines/github/hashicorp/nomad/15268/workflows/a938267c-e127-49f9-9cef-707ae434d486/jobs/144218 .

Apparently, in some cases, we did reset the job status in setJobStatus handler before, but the logic changed in in https://github.com/hashicorp/nomad/pull/9768/files#diff-53c7d524897a192ece819816a9dcf006dee46747fbb853e7f638601ea556a72cL4430-L4432 . Sadly, I'm failing to get the tests passing if I unilaterily set the job status to "", so I will dig in more.

tgross

LGTM. That we hit this bug with the dispatch jobs seems like it really helps because we don't have the non-determinism of a leader election to reproduce.

tgross · 2021-03-26T13:02:05Z

e2e/periodic/input/simple.nomad

@@ -4,10 +4,12 @@ job "periodic" {

  constraint {
    attribute = "${attr.kernel.name}"
-    value     = "linux"
+    operator  = "set_contains_any"
+    value     = "darwin,linux"


That's handy, good catch. There's probably a bunch of tests where this is safe to do. Something for a separate PR though.

Interestingly, set_contains_any isn't documented in https://www.nomadproject.io/docs/job-specification/constraint - it's only documented for affinity constraints so we should update that too.

Also, for future readers, I considered having the constraint be based on docker os attribute - but chose not to. Windows is a bit slow, and I didn't want to run tests on LCOW Windows clients if we add some.

* Added missing changlog entry for GH-10145 * Fixed misclassified entry for GH-10158

* Fixed order * Added missing changlog entry for GH-10145 * Fixed misclassified entry for GH-10158

periodic: always reset periodic children status

github-actions · 2022-11-21T02:31:15Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop added type/bug stage/needs-backporting labels Mar 9, 2021

notnoop requested review from tgross and jrasell March 9, 2021 16:27

tgross approved these changes Mar 9, 2021

View reviewed changes

jrasell approved these changes Mar 10, 2021

View reviewed changes

notnoop mentioned this pull request Mar 24, 2021

Dispatch batch job summary reporting negative values #10222

Open

vercel bot deployed to Preview – nomad-storybook-and-ui March 25, 2021 15:14 View deployment

vercel bot temporarily deployed to Preview – nomad March 25, 2021 15:14 Inactive

Mahmood Ali added 2 commits March 25, 2021 11:27

Add a test for parameterized summary counts

032945b

notnoop force-pushed the b-periodic-init-status branch from 8d0d26c to 032945b Compare March 25, 2021 15:29

vercel bot deployed to Preview – nomad-storybook-and-ui March 25, 2021 15:29 View deployment

vercel bot temporarily deployed to Preview – nomad March 25, 2021 15:29 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui March 25, 2021 15:32 View deployment

vercel bot temporarily deployed to Preview – nomad March 25, 2021 17:06 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui March 25, 2021 17:06 View deployment

dispatched parameterized job should clear status too

d81df3d

notnoop force-pushed the b-periodic-init-status branch from 14b5511 to d81df3d Compare March 25, 2021 19:15

vercel bot temporarily deployed to Preview – nomad March 25, 2021 19:15 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui March 25, 2021 19:15 View deployment

notnoop requested a review from tgross March 25, 2021 23:18

tgross approved these changes Mar 26, 2021

View reviewed changes

notnoop merged commit 274e795 into main Mar 26, 2021

notnoop deleted the b-periodic-init-status branch March 26, 2021 13:19

tgross added a commit that referenced this pull request Mar 30, 2021

changelog: add section for backports from 1.1.0 to 1.0.5

32fc593

* Added missing changlog entry for GH-10145 * Fixed misclassified entry for GH-10158

tgross mentioned this pull request Mar 30, 2021

changelog: add section for backports from 1.1.0 to 1.0.5 #10267

Merged

tgross added a commit that referenced this pull request Mar 30, 2021

changelog: add section for backports from 1.1.0 to 1.0.5

5cc5d80

* Fixed order * Added missing changlog entry for GH-10145 * Fixed misclassified entry for GH-10158

tgross added a commit that referenced this pull request Mar 30, 2021

changelog: add section for backports from 1.1.0 to 1.0.5

f1645dd

* Fixed order * Added missing changlog entry for GH-10145 * Fixed misclassified entry for GH-10158

tgross added this to the 1.1.0 milestone Apr 28, 2021

tgross mentioned this pull request Apr 28, 2021

Nomad do not clean old periodic tasks #10456

Open

schmichael pushed a commit that referenced this pull request May 14, 2021

Merge pull request #10145 from hashicorp/b-periodic-init-status

c9901b3

periodic: always reset periodic children status

tgross removed the stage/needs-backporting label May 24, 2021

lgfa29 mentioned this pull request Sep 2, 2021

Negative values in UI #11113

Open

github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

periodic: always reset periodic children status #10145

periodic: always reset periodic children status #10145

notnoop commented Mar 9, 2021

tgross left a comment

tgross Mar 9, 2021

notnoop Mar 25, 2021

tgross commented Mar 9, 2021

jrasell left a comment

tgross commented Mar 24, 2021

notnoop commented Mar 25, 2021

tgross left a comment

tgross Mar 26, 2021

notnoop Mar 26, 2021

github-actions bot commented Nov 21, 2022

periodic: always reset periodic children status #10145

periodic: always reset periodic children status #10145

Conversation

notnoop commented Mar 9, 2021

Alternative implementation

Debugging notes

tgross left a comment

Choose a reason for hiding this comment

tgross Mar 9, 2021

Choose a reason for hiding this comment

notnoop Mar 25, 2021

Choose a reason for hiding this comment

tgross commented Mar 9, 2021

jrasell left a comment

Choose a reason for hiding this comment

tgross commented Mar 24, 2021

notnoop commented Mar 25, 2021

tgross left a comment

Choose a reason for hiding this comment

tgross Mar 26, 2021

Choose a reason for hiding this comment

notnoop Mar 26, 2021

Choose a reason for hiding this comment

github-actions bot commented Nov 21, 2022