Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[15090] Ensure no leakage of evaluations for batch jobs. #15097

Merged
merged 12 commits into from
Jan 31, 2023

Conversation

stswidwinski
Copy link
Contributor

@stswidwinski stswidwinski commented Nov 1, 2022

Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to memory leakage as reported here:

#15090
#14842
#4532

The comments within the code and tests are self-contradictory. Some state that evals of batch jobs should never be GCed others claim that they should be GCed if a new job version is created (as determined by comparing the indexes). I believe that the latter is the expected state.

Thus, we compare the creation index of the allocation with the modification index of the job to determine whether or not an alloc belongs to the current job version or not.

This pull request specifically:

  1. Removes references to alloc.Job which is unsafe given the deserialization of allocs from memDB which may include invalid pointers
  2. Fixes the logical unity described above (compare creation with modification and not creation with creation)
  3. Fixes the test (the test breaks the assumption of alloc reparenting, thus testing the mock rather than production code)
  4. Adds more tests

Any and all feedback welcome.

For more details, please see the issues linked. Specifically #15090 describes the issue in detail.

EDIT:
I am assuming here that an update to the job results in an eval and consequently allocs for the job. If this is not correct than we must take into account the ordering of evals/allocs which has not been done prior.

@hashicorp-cla
Copy link

hashicorp-cla commented Nov 1, 2022

CLA assistant check
All committers have signed the CLA.

@stswidwinski
Copy link
Contributor Author

cc: @jrasell

@stswidwinski stswidwinski marked this pull request as draft November 2, 2022 18:33
@stswidwinski
Copy link
Contributor Author

stswidwinski commented Nov 2, 2022

I need to test this a little bit more, but I believe that the current code actually references relatively random memory which is the result of dereferences of pointers which are serialized and de-serialized with the underlying object being non-static (that is: rellocable).

In particular the de-serialization proceeds by using the eval id as an index (https://github.com/hashicorp/nomad/blob/main/nomad/core_sched.go#L294). The GC logic is quite careful not to address pointers in structs except for this one place.

EDIT: It seems that my hypothesis was correct. I have deployed it to a test cluster and observed that the correct gc timeouts on evals are now applied to batch jobs and we do not start allocations when we shouldn't. I will modify the fix here and open up for review tomorrow.

… of allocations that belong to batch jobs. Add Eval GC to batch job GC. Add tests.
@stswidwinski stswidwinski marked this pull request as ready for review November 3, 2022 14:21
@D4GGe
Copy link

D4GGe commented Nov 29, 2022

Any progress of merging this? Having huge problems in our environment with this!

@stswidwinski
Copy link
Contributor Author

stswidwinski commented Nov 29, 2022 via email

@lgfa29 lgfa29 added this to the 1.5.0 milestone Dec 1, 2022
…ows us to control the pace of GC of Batch Evals independent of other evals.
@shantanugadgil
Copy link
Contributor

any chance of this making it into the 1.4.x series as well? In addition to the 1.5.0 release?

@tgross
Copy link
Member

tgross commented Dec 8, 2022

Sorry for the delay on this... Luiz is on a well-deserved holiday so I'm going to pick up the review from here with Seth. We're juggling a couple other things at the moment but this is definitely important so expect we'll follow through soonish. Thanks for your patience.

@shantanugadgil this is a bug fix, so it'll get backported to two major versions (i.e. if we release it in 1.5.0 it'll get backported to 1.4.x and 1.3.x).

@shantanugadgil
Copy link
Contributor

@tgross Thanks for the update. 🙏 Looking forward to the backported releases of 1.4.x. 👍

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @stswidwinski! Thanks for your patience on this. I've given it a review and I've left a few comments to tighten up the code that'll hopefully make it easier for the next folks who read this 😁

But I feel pretty good about this approach overall. I'm getting towards the end of my week here so I'm going to give this one more pass early next week and hopefully we can wrap it up then. Thanks!

nomad/core_sched.go Show resolved Hide resolved
nomad/core_sched.go Outdated Show resolved Hide resolved
command/agent/config.go Show resolved Hide resolved
nomad/core_sched_test.go Outdated Show resolved Hide resolved
nomad/core_sched_test.go Show resolved Hide resolved
nomad/core_sched_test.go Outdated Show resolved Hide resolved
Comment on lines 563 to 565
// An EvalGC should reap allocations from jobs with a newer modify index and reap the eval itself
// if all allocs are reaped.
func TestCoreScheduler_EvalGC_Batch_OldVersionReapsEval(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests have a lot of setup code (as I'm sure you noticed!) and the assertion is only subtly different than the previous one. It might make the whole test more understandable if we collapsed these into a single test and had this one be the final assertion that the eval is GC'd once the conditions are right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I have actually spent quite a bit of time unravelling these test structures. The reason why making this into a single test is non-trivial is the way in which the time table is used and the fact that there does not exist a clock whose reading we may set to a particular time.

In essence, time.Now() is used throughout which always gives the current time without the ability to set the clock. When we insert time events into the time table, we must do so from oldest-to-newest for the time table to allow us to select older events (the sorting logic is: "give me the last event that has occurred past time X").

As such, I resorted to forceful reset of the time table for each logical test sharing the setup.

Please see the changes.

@stswidwinski
Copy link
Contributor Author

stswidwinski commented Dec 9, 2022 via email

@stswidwinski
Copy link
Contributor Author

I've actually found a bug in the logic in which we would apply the regular GC timeout first and not GC evaluations that should be GCed if BatchEvalGCPeriod < EvalGCPeriod. Fixed. Adding all of the changes soon.

@vercel
Copy link

vercel bot commented Dec 12, 2022

@stswidwinski is attempting to deploy a commit to the HashiCorp Team on Vercel.

A member of the Team first needs to authorize it.

@stswidwinski
Copy link
Contributor Author

I tested the suggestions I made and they do seem to fix the problem. The changes I made are in 0d80be0. Feel free to cherry-pick them if they look correct, but also feel free to fix the test with a different approach slightly_smiling_face

This sounds great. I cherrypicked the changes on top.

@tgross tgross added backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Jan 30, 2023
@stswidwinski
Copy link
Contributor Author

Note: I'll add the changelog entries soon :-)

@stswidwinski stswidwinski requested review from lgfa29 and removed request for tgross January 30, 2023 18:45
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pulled down this branch and re-ran the bench testing I did previously. LGTM! Thanks for your patience in seeing this one through @stswidwinski!

@stswidwinski
Copy link
Contributor Author

🎆 Thank you for bearing with me

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work @stswidwinski!

I pushed a commit to expand the changelog entry a bit more since this will be a breaking change.

For the upgrade note, I will add it in a separate PR since this one will be backported.

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 31, 2023

Sorry, just noticed a mistake in my changes 😅

I pushed a commit to fix it.

@tgross tgross merged commit 2285432 into hashicorp:main Jan 31, 2023
tgross pushed a commit that referenced this pull request Jan 31, 2023
Prior to 2409f72 the code compared the modification index of a job to
itself. Afterwards, the code compared the creation index of the job to
itself. In either case there should never be a case of re-parenting of allocs
causing the evaluation to trivially always result in false, which leads to
unreclaimable memory.

Prior to this change allocations and evaluations for batch jobs were never
garbage collected until the batch job was explicitly stopped. The new
`batch_eval_gc_threshold` server configuration controls how often they are
collected. The default threshold is `24h`.
tgross pushed a commit that referenced this pull request Jan 31, 2023
Prior to 2409f72 the code compared the modification index of a job to
itself. Afterwards, the code compared the creation index of the job to
itself. In either case there should never be a case of re-parenting of allocs
causing the evaluation to trivially always result in false, which leads to
unreclaimable memory.

Prior to this change allocations and evaluations for batch jobs were never
garbage collected until the batch job was explicitly stopped. The new
`batch_eval_gc_threshold` server configuration controls how often they are
collected. The default threshold is `24h`.

Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
tgross pushed a commit that referenced this pull request Jan 31, 2023
Prior to 2409f72 the code compared the modification index of a job to
itself. Afterwards, the code compared the creation index of the job to
itself. In either case there should never be a case of re-parenting of allocs
causing the evaluation to trivially always result in false, which leads to
unreclaimable memory.

Prior to this change allocations and evaluations for batch jobs were never
garbage collected until the batch job was explicitly stopped. The new
`batch_eval_gc_threshold` server configuration controls how often they are
collected. The default threshold is `24h`.
tgross pushed a commit that referenced this pull request Jan 31, 2023
Prior to 2409f72 the code compared the modification index of a job to
itself. Afterwards, the code compared the creation index of the job to
itself. In either case there should never be a case of re-parenting of allocs
causing the evaluation to trivially always result in false, which leads to
unreclaimable memory.

Prior to this change allocations and evaluations for batch jobs were never
garbage collected until the batch job was explicitly stopped. The new
`batch_eval_gc_threshold` server configuration controls how often they are
collected. The default threshold is `24h`.

Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>
evenius pushed a commit to onmo-games/nomad that referenced this pull request Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad server nodes using up all available host memory
7 participants