[15090] Ensure no leakage of evaluations for batch jobs. #15097

stswidwinski · 2022-11-01T15:50:06Z

Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to memory leakage as reported here:

#15090
#14842
#4532

The comments within the code and tests are self-contradictory. Some state that evals of batch jobs should never be GCed others claim that they should be GCed if a new job version is created (as determined by comparing the indexes). I believe that the latter is the expected state.

Thus, we compare the creation index of the allocation with the modification index of the job to determine whether or not an alloc belongs to the current job version or not.

This pull request specifically:

Removes references to alloc.Job which is unsafe given the deserialization of allocs from memDB which may include invalid pointers
Fixes the logical unity described above (compare creation with modification and not creation with creation)
Fixes the test (the test breaks the assumption of alloc reparenting, thus testing the mock rather than production code)
Adds more tests

Any and all feedback welcome.

For more details, please see the issues linked. Specifically #15090 describes the issue in detail.

EDIT:
I am assuming here that an update to the job results in an eval and consequently allocs for the job. If this is not correct than we must take into account the ordering of evals/allocs which has not been done prior.

hashicorp-cla · 2022-11-01T15:50:10Z

All committers have signed the CLA.

stswidwinski · 2022-11-02T17:47:08Z

cc: @jrasell

stswidwinski · 2022-11-02T19:56:51Z

I need to test this a little bit more, but I believe that the current code actually references relatively random memory which is the result of dereferences of pointers which are serialized and de-serialized with the underlying object being non-static (that is: rellocable).

In particular the de-serialization proceeds by using the eval id as an index (https://github.com/hashicorp/nomad/blob/main/nomad/core_sched.go#L294). The GC logic is quite careful not to address pointers in structs except for this one place.

EDIT: It seems that my hypothesis was correct. I have deployed it to a test cluster and observed that the correct gc timeouts on evals are now applied to batch jobs and we do not start allocations when we shouldn't. I will modify the fix here and open up for review tomorrow.

… of allocations that belong to batch jobs. Add Eval GC to batch job GC. Add tests.

nomad/core_sched.go

D4GGe · 2022-11-29T08:15:55Z

Any progress of merging this? Having huge problems in our environment with this!

stswidwinski · 2022-11-29T10:03:01Z

At this point we're waiting for Nomad folks to reproduce and reason about the initial reason for the existing code. Given that it's not uncommon for bugs to hide bugs, I appreciate the scrutiny. I hope we will hear back soon. Take a look at the issue attached to the PR for the discussion so far.

…

On Tue, Nov 29, 2022, 3:16 AM Daniel Fredriksson ***@***.***> wrote: Any progress of merging this? Having huge problems in our environment with this! — Reply to this email directly, view it on GitHub <#15097 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AECDANFJPWCK3JHAF63VKVDWKW3UPANCNFSM6AAAAAARUF57ZY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

…ows us to control the pace of GC of Batch Evals independent of other evals.

shantanugadgil · 2022-12-08T12:48:37Z

any chance of this making it into the 1.4.x series as well? In addition to the 1.5.0 release?

tgross · 2022-12-08T13:57:45Z

Sorry for the delay on this... Luiz is on a well-deserved holiday so I'm going to pick up the review from here with Seth. We're juggling a couple other things at the moment but this is definitely important so expect we'll follow through soonish. Thanks for your patience.

@shantanugadgil this is a bug fix, so it'll get backported to two major versions (i.e. if we release it in 1.5.0 it'll get backported to 1.4.x and 1.3.x).

shantanugadgil · 2022-12-09T08:08:42Z

@tgross Thanks for the update. 🙏 Looking forward to the backported releases of 1.4.x. 👍

tgross

Hi @stswidwinski! Thanks for your patience on this. I've given it a review and I've left a few comments to tighten up the code that'll hopefully make it easier for the next folks who read this 😁

But I feel pretty good about this approach overall. I'm getting towards the end of my week here so I'm going to give this one more pass early next week and hopefully we can wrap it up then. Thanks!

nomad/core_sched.go

command/agent/config.go

nomad/core_sched_test.go

tgross · 2022-12-09T19:48:23Z

nomad/core_sched_test.go

+// An EvalGC should reap allocations from jobs with a newer modify index and reap the eval itself
+// if all allocs are reaped.
+func TestCoreScheduler_EvalGC_Batch_OldVersionReapsEval(t *testing.T) {


These tests have a lot of setup code (as I'm sure you noticed!) and the assertion is only subtly different than the previous one. It might make the whole test more understandable if we collapsed these into a single test and had this one be the final assertion that the eval is GC'd once the conditions are right.

Well, I have actually spent quite a bit of time unravelling these test structures. The reason why making this into a single test is non-trivial is the way in which the time table is used and the fact that there does not exist a clock whose reading we may set to a particular time.

In essence, time.Now() is used throughout which always gives the current time without the ability to set the clock. When we insert time events into the time table, we must do so from oldest-to-newest for the time table to allow us to select older events (the sorting logic is: "give me the last event that has occurred past time X").

As such, I resorted to forceful reset of the time table for each logical test sharing the setup.

Please see the changes.

stswidwinski · 2022-12-09T21:16:52Z

Sounds great. Acking the comment and I'll come back to fix up the code early next week.

…

On Fri, Dec 9, 2022, 2:52 PM Tim Gross ***@***.***> wrote: ***@***.**** commented on this pull request. Hi @stswidwinski <https://github.com/stswidwinski>! Thanks for your patience on this. I've given it a review and I've left a few comments to tighten up the code that'll hopefully make it easier for the next folks who read this 😁 But I feel pretty good about this approach overall. I'm getting towards the end of my week here so I'm going to give this one more pass early next week and hopefully we can wrap it up then. Thanks! ------------------------------ In nomad/core_sched.go <#15097 (comment)>: > +// olderVersionTerminalAllocs returns a tuplie ([]string, bool). The first element is the list of +// terminal allocations which may be garbage collected for batch jobs. The second element indicates +// whether or not the allocation itself may be garbage collected. ⬇️ Suggested change -// olderVersionTerminalAllocs returns a tuplie ([]string, bool). The first element is the list of -// terminal allocations which may be garbage collected for batch jobs. The second element indicates -// whether or not the allocation itself may be garbage collected. +// olderVersionTerminalAllocs returns a tuple ([]string, bool). The first element is the list of +// terminal allocations which may be garbage collected for batch jobs. The second element indicates +// whether or not the evaluation itself may be garbage collected. Also, the caller can determine if its safe to GC the eval because the length of the allocs parameter has to match the length of the return value, right? And if there are no allocs left, is it safe to just go ahead and GC it without calling this function? ------------------------------ In command/agent/config.go <#15097 (comment)>: > EvalGCThreshold string `hcl:"eval_gc_threshold"` + // BatchEvalGCThreshold controls how "old" an evaluation must be to be eligible + // for GC if the eval belongs to a batch job. + BatchEvalGCThreshold string `hcl:"batch_eval_gc_threshold"` There's a ParseDuration call we'll need to make in command/agent.go as well. See the equivalent for EvalGCThreshold at agent.go#L359-L365 <https://github.com/hashicorp/nomad/blob/v1.4.3/command/agent/agent.go#L359-L365> for an example ------------------------------ In nomad/core_sched_test.go <#15097 (comment)>: > + if err != nil { + t.Fatalf("err: %v", err) + } + if out == nil { + t.Fatalf("bad: %v", out) + } We're migrating over to https://github.com/shoenig/test as we touch old tests and write new ones. For all the new assertions, let's use that. ⬇️ Suggested change - if err != nil { - t.Fatalf("err: %v", err) - } - if out == nil { - t.Fatalf("bad: %v", out) - } + must.NoError(t, err) + must.NotNil(t, out) ------------------------------ In nomad/core_sched_test.go <#15097 (comment)>: > if err != nil { t.Fatalf("err: %v", err) } - // Update the time tables to make this work - tt := s1.fsm.TimeTable() - tt.Witness(2000, time.Now().UTC().Add(-1*s1.config.EvalGCThreshold)) + // A little helper for assertions + assertCorrectEvalAlloc := func( + ws memdb.WatchSet, + eval *structs.Evaluation, + allocsShouldExist []*structs.Allocation, + allocsShouldNotExist []*structs.Allocation, + ) { This makes test outputs a lot more legible if a test fails: ⬇️ Suggested change - ) { + ) { + t.Helper() ------------------------------ In nomad/core_sched_test.go <#15097 (comment)>: > - alloc3.Job = job2 - alloc3.JobID = job2.ID + // Insert allocs with indexes older than job.ModifyIndex. Two cases: For clarity, because of what we'll actually check: ⬇️ Suggested change - // Insert allocs with indexes older than job.ModifyIndex. Two cases: + // Insert allocs with indexes older than job.ModifyIndex and job.JobModifyIndex. Two cases: ------------------------------ In nomad/core_sched_test.go <#15097 (comment)>: > +// An EvalGC should reap allocations from jobs with a newer modify index and reap the eval itself +// if all allocs are reaped. +func TestCoreScheduler_EvalGC_Batch_OldVersionReapsEval(t *testing.T) { These tests have a *lot* of setup code (as I'm sure you noticed!) and the assertion is only subtly different than the previous one. It might make the whole test more understandable if we collapsed these into a single test and had this one be the final assertion that the eval is GC'd once the conditions are right. ------------------------------ In nomad/core_sched.go <#15097 (comment)>: > for _, alloc := range allocs { - if alloc.Job != nil && alloc.Job.CreateIndex < job.CreateIndex && alloc.TerminalStatus() { + if alloc.CreateIndex < job.JobModifyIndex && alloc.ModifyIndex < thresholdIndex && alloc.TerminalStatus() { I think so. (*Job).JobModifyIndex is only updated when the user-facing version of the job is updated (as opposed to (*Job).ModifyIndex which is updated whenever the job is updated in raft for any reason, like being marked complete or whatever. — Reply to this email directly, view it on GitHub <#15097 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AECDANCO2YJDFUGIJEN554LWMOEYLANCNFSM6AAAAAARUF57ZY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

stswidwinski · 2022-12-12T22:33:07Z

I've actually found a bug in the logic in which we would apply the regular GC timeout first and not GC evaluations that should be GCed if BatchEvalGCPeriod < EvalGCPeriod. Fixed. Adding all of the changes soon.

vercel · 2022-12-12T22:41:12Z

@stswidwinski is attempting to deploy a commit to the HashiCorp Team on Vercel.

A member of the Team first needs to authorize it.

evals in face of purge.

stswidwinski · 2023-01-30T18:16:56Z

I tested the suggestions I made and they do seem to fix the problem. The changes I made are in 0d80be0. Feel free to cherry-pick them if they look correct, but also feel free to fix the test with a different approach slightly_smiling_face

This sounds great. I cherrypicked the changes on top.

stswidwinski · 2023-01-30T18:26:07Z

Note: I'll add the changelog entries soon :-)

tgross

I've pulled down this branch and re-ran the bench testing I did previously. LGTM! Thanks for your patience in seeing this one through @stswidwinski!

stswidwinski · 2023-01-31T13:57:03Z

🎆 Thank you for bearing with me

lgfa29

Thanks for this work @stswidwinski!

I pushed a commit to expand the changelog entry a bit more since this will be a breaking change.

For the upgrade note, I will add it in a separate PR since this one will be backported.

lgfa29 · 2023-01-31T16:37:40Z

Sorry, just noticed a mistake in my changes 😅

I pushed a commit to fix it.

Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.

Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`. Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>

Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.

Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory. Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`. Co-authored-by: stswidwinski <stan.swidwinski@gmail.com>

Ensure comparison of creation to modification index and add tests.

721bb72

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 15:53 View deployment

Fix typo

439a190

vercel bot deployed to Preview – nomad-storybook-and-ui November 1, 2022 16:00 View deployment

stswidwinski marked this pull request as draft November 2, 2022 18:33

Ensure we do not reference unsafe memory, add threshold checks for GC…

4943db4

… of allocations that belong to batch jobs. Add Eval GC to batch job GC. Add tests.

stswidwinski marked this pull request as ready for review November 3, 2022 14:21

stswidwinski commented Nov 3, 2022

View reviewed changes

nomad/core_sched.go Show resolved Hide resolved

vercel bot deployed to Preview – nomad-storybook-and-ui November 3, 2022 14:25 View deployment

shoenig self-requested a review November 17, 2022 21:53

tgross mentioned this pull request Nov 23, 2022

Nomad server nodes using up all available host memory #12445

Closed

tgross linked an issue Nov 23, 2022 that may be closed by this pull request

Nomad server nodes using up all available host memory #12445

Closed

lgfa29 added this to the 1.5.0 milestone Dec 1, 2022

As described in hashicorp#15090 let us introduce a new knob which all…

823b8bc

…ows us to control the pace of GC of Batch Evals independent of other evals.

vercel bot deployed to Preview – nomad-storybook-and-ui December 5, 2022 19:48 View deployment

tgross self-requested a review December 7, 2022 21:49

tgross reviewed Dec 9, 2022

View reviewed changes

Review changes

6967f3b

Whitespace

af90058

Changes as requested by lgfa29 and tgross. Primarily handling of batch

4609d8e

evals in face of purge.

stswidwinski force-pushed the 15090-infinite-memory-growth branch from e63cbc1 to 4609d8e Compare January 30, 2023 18:16

tgross added backport/1.2.x backport to 1.1.x release line backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Jan 30, 2023

vercel bot deployed to Preview – nomad-storybook-and-ui January 30, 2023 18:22 View deployment

Add server.mdx description and changelog entry.

6d17a30

stswidwinski requested review from lgfa29 and removed request for tgross January 30, 2023 18:45

tgross approved these changes Jan 31, 2023

View reviewed changes

docs: update hashicorp#15097 changelog

1d26e6c

lgfa29 approved these changes Jan 31, 2023

View reviewed changes

changelog: fix hashicorp#15097

077138c

lgfa29 mentioned this pull request Jan 31, 2023

docs: add upgrade notice for batch GC changes #15985

Merged

vercel bot deployed to Preview – nomad-storybook-and-ui January 31, 2023 16:53 View deployment

tgross merged commit 2285432 into hashicorp:main Jan 31, 2023

This was referenced Jan 31, 2023

Memory leak on nomad servers #14842

Closed

Infinite memory growth in Nomad Agent and Nomad Client when using batch jobs. #15090

Closed

evenius pushed a commit to onmo-games/nomad that referenced this pull request Apr 25, 2023

merged the code from the pullrequest hashicorp#15097

dd7c0b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[15090] Ensure no leakage of evaluations for batch jobs. #15097

[15090] Ensure no leakage of evaluations for batch jobs. #15097

stswidwinski commented Nov 1, 2022 •

edited

Loading

hashicorp-cla commented Nov 1, 2022 •

edited

Loading

stswidwinski commented Nov 2, 2022

stswidwinski commented Nov 2, 2022 •

edited

Loading

D4GGe commented Nov 29, 2022

stswidwinski commented Nov 29, 2022 via email

shantanugadgil commented Dec 8, 2022

tgross commented Dec 8, 2022

shantanugadgil commented Dec 9, 2022

tgross left a comment

tgross Dec 9, 2022

stswidwinski Dec 12, 2022

stswidwinski commented Dec 9, 2022 via email

stswidwinski commented Dec 12, 2022

vercel bot commented Dec 12, 2022

stswidwinski commented Jan 30, 2023

stswidwinski commented Jan 30, 2023

tgross left a comment

stswidwinski commented Jan 31, 2023

lgfa29 left a comment

lgfa29 commented Jan 31, 2023

[15090] Ensure no leakage of evaluations for batch jobs. #15097

[15090] Ensure no leakage of evaluations for batch jobs. #15097

Conversation

stswidwinski commented Nov 1, 2022 • edited Loading

hashicorp-cla commented Nov 1, 2022 • edited Loading

stswidwinski commented Nov 2, 2022

stswidwinski commented Nov 2, 2022 • edited Loading

D4GGe commented Nov 29, 2022

stswidwinski commented Nov 29, 2022 via email

shantanugadgil commented Dec 8, 2022

tgross commented Dec 8, 2022

shantanugadgil commented Dec 9, 2022

tgross left a comment

Choose a reason for hiding this comment

tgross Dec 9, 2022

Choose a reason for hiding this comment

stswidwinski Dec 12, 2022

Choose a reason for hiding this comment

stswidwinski commented Dec 9, 2022 via email

stswidwinski commented Dec 12, 2022

vercel bot commented Dec 12, 2022

stswidwinski commented Jan 30, 2023

stswidwinski commented Jan 30, 2023

tgross left a comment

Choose a reason for hiding this comment

stswidwinski commented Jan 31, 2023

lgfa29 left a comment

Choose a reason for hiding this comment

lgfa29 commented Jan 31, 2023

stswidwinski commented Nov 1, 2022 •

edited

Loading

hashicorp-cla commented Nov 1, 2022 •

edited

Loading

stswidwinski commented Nov 2, 2022 •

edited

Loading