Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guards around subtracting summary count #4949

Merged
merged 1 commit into from
Dec 4, 2018
Merged

Conversation

preetapan
Copy link
Member

This PR adds a check such that job summary counts won't be set to negative numbers.

@notnoop
Copy link
Contributor

notnoop commented Dec 3, 2018

if we go with this approach, i'd suggest logging so we can be in the look out and have more debugging information if it happens again.

@schmichael
Copy link
Member

schmichael commented Dec 3, 2018

Would detecting values <0 and doing a full recount of alloc statuses be too expensive? It seems likely that if we can go below 0 we may be inflating counts as well. Recalculating might also make for useful debugging information:

if anyNegative {
  newSummary := s.reconcileSummary(jobSummary)
  s.logger.Warn("invalid job summary counts; recalculated", "invalid", jobSummary, "recalculated", newSummary)
  jobSummary = newSummary
}

(If we suspect bugs in this code and recalculating is cheap we could even recalculate randomly -- say 1 out of every 1000 updateSummary calls -- and log discrepancies.)

@jippi
Copy link
Contributor

jippi commented Dec 4, 2018

in my experience, the job summary count is pretty rarely accurate without triggering a reconcileSummary across all versions of nomad

@preetapan preetapan merged commit e5d125e into master Dec 4, 2018
@tantra35
Copy link
Contributor

tantra35 commented Dec 11, 2018

IMHO this fix is incorrect, and doesn't solve problems when job summary disappear when server leader moves to another node, see last comments in issue #4731

@tantra35
Copy link
Contributor

@preetapan your colleague @schmichael absolutely right, this PR solve issue only on half, and as i mention before doesn't solve problems when leader server moves to another node. Please take account on this

@preetapan
Copy link
Member Author

@tantra35 yes we are aware that this fix was only fixing negative summaries in the CLI display and we are tracking finding the root cause in our internal bug tracking system.

#4731 is about unnecessary periodic jobs and is related but a different issue. This PR was addressing reports of negative summaries in batch jobs.

@notnoop notnoop deleted the b-neg-running-summary branch January 19, 2019 02:06
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants