Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite memory growth in Nomad Agent and Nomad Client when using batch jobs. #15090

Closed
stswidwinski opened this issue Oct 31, 2022 · 12 comments
Closed

Comments

@stswidwinski
Copy link
Contributor

stswidwinski commented Oct 31, 2022

Nomad version

Nomad v1.3.1 (2b054e3)
But this is broken on v.1.4.2 as well.

Operating system and Environment details

Unix.

Issue

Evaluations and Allocations for Batch jobs are not garbage collected as long as the batch job is not stopped. The growth continues ad-infinitum causing two side-effects:

  1. The RSS of the Scheduler grows linearly in the number of evals and allocs
  2. The RSS of the Agent grows linearly in the number of allocs
  3. The Agent's goroutine count grows linearly in the number of allocs.

There are only two ways to remove this memory growth:

  1. Run nomad system gc (which removes any history and data)
  2. Stop the job and start a new one

Both of which put the burden of cleanup on the end-user. Now, let me point to the exact lines of code which are responsible for the behavior and go through the logic in detail. First, the GC of evaluations and allocations for batch jobs is somewhat special-cased:

https://github.com/hashicorp/nomad/blob/main/nomad/core_sched.go#L303-L328

The evaluation is never garbage collected unless the corresponding job is fully stopped. allowBatch is somewhat moot because it is never true unless we force gc. Now, infinite growth of allocations would be bad, but would not cause the infinite memory growth on clients. I believe that this is the reasoning behind line 325 which attempts to GC old allocations:

https://github.com/hashicorp/nomad/blob/main/nomad/core_sched.go#L347-L357

However, please notice that an allocation is never "re-parented" between different jobs and the .Job object is a pointer. Hence, we always get the equality of alloc.Job == job which yields alloc.Job.CreateIndex == job.CreateIndex. Hence, this is never garbage collected. Please note that this is due to the fact that the CreateIndex is set only once during the first raft-level serialization of the object:

job.CreateIndex = existing.(*structs.Job).CreateIndex

The Job pointer of an allocation is similarly only ever set when the job is (de-)serialized for the first time:

func (s *StateStore) upsertAllocsImpl(index uint64, allocs []*structs.Allocation, txn *txn) error {

History.

Please note that this logic has been broken for quite some time. In 2018 the logic shifted from using ModifyIndex to CreateIndex:

2409f72

But the original logic still wouldn't have worked due to Job being a pointer rather than a value. Similar bugs have been reported in the past:

#4532
#14842

Fix.

The fix here is to consider not the entire history of allocations for a batch job, but only the most recent one (or even 10 most recent ones or some other small n).

Reproduction steps

Create a batch job with arbitrary command. Assume /usr/bin/true for this repro. What is important is that the type of the job remain batch. Then, proceed to create some number of job updates with trivial changes to the job spec which do not result in functional changes.

This is to ensure that we trigger a new evaluation and a new job version -- the simplest way to achieve this is to add any random nonce metadata token which changes with every update. Observe infinite memory growth.

Expected Result

Allocations should be garbage collected with time and the infinite history should not persist.

Actual Result

The opposite.

Job file (if appropriate)

@stswidwinski
Copy link
Contributor Author

stswidwinski commented Nov 1, 2022

Having though about this more and looked into more code, I think the fix might be as simple as comparing the alloc.CreateIndex with the alloc.Job.ModifyIndex which is a combination of the code before 2409f72 and afterwards.

That said, the code is internally inconsistent. This test seems to assert that we should never GC allocs or evals inbetween runs:

nomad/nomad/core_sched_test.go

Lines 1298 to 1312 in 2b054e3

// This test ensures that batch jobs are GC'd in one shot, meaning it all
// allocs/evals and job or nothing
func TestCoreScheduler_JobGC_OneShot(t *testing.T) {
ci.Parallel(t)
s1, cleanupS1 := TestServer(t, nil)
defer cleanupS1()
testutil.WaitForLeader(t, s1.RPC)
// COMPAT Remove in 0.6: Reset the FSM time table since we reconcile which sets index 0
s1.fsm.timetable.table = make([]TimeTableEntry, 1, 10)
// Insert job.
store := s1.fsm.State()
job := mock.Job()

However, this code seems to indicate something very different:

oldAllocs := olderVersionTerminalAllocs(allocs, job)

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 8, 2022

Thanks for the report and the PR @stswidwinski 😄

I've been looking into this but I'm not 100% sure of the details yet. At first sight your analysis makes a lot of sense, in

if alloc.Job != nil && alloc.Job.CreateIndex < job.CreateIndex && alloc.TerminalStatus() {
  # ...
}

alloc.Job.CreateIndex < job.CreateIndex seems like it should always be false as you mentioned, but when I tried to reproduce the problem, the GC worked and removed the batch job allocs and evals.

Any additional details about your test environment that you could provide? Like how many servers and clients are you using?

@lgfa29 lgfa29 self-assigned this Nov 8, 2022
@lgfa29 lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Nov 8, 2022
@lgfa29 lgfa29 moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Nov 8, 2022
@stswidwinski
Copy link
Contributor Author

stswidwinski commented Nov 8, 2022 via email

@stswidwinski
Copy link
Contributor Author

@lgfa29, I think I know what problem you were seeing. When the job specification does not change between updates the former allocation is reused to reduce the churn within a fleet. However, any change to the job spec will induce a new allocation. However, even without changes to the spec one may easily see the agglomeration of the evaluations in perpetuity.

Here is a repro starting with installation of the nomad RPM and going all the way to tracking evaluations and allocations in time. Setup:

# Local installation
$ yumdownloader nomad-1.4.2
$ rpm2cpio nomad-1.4.2-1.x86_64.rpm | cpio -idmv --no-absolute-filenames

# Local Nomad configuration
## Note that it's very basic with aggressive GC settings just to make sure that 
## we don't have to wait for GC
$ cat nomad.hcl
{
  "datacenter": "test_datacenter",
  "region": "test_region",
  "data_dir": "/tmp/data",
  "addresses": {
    "http": "localhost",
    "rpc": "localhost",
    "serf": "localhost"
  },
  "leave_on_interrupt": true,
  "leave_on_terminate": true,
  "disable_update_check": true,
  "log_level": "DEBUG",
  "log_file": "/tmp/nomad_logs",
  "server": {
    "enabled": true,
    "heartbeat_grace": "10s",
    "bootstrap_expect": 1,
    "node_gc_threshold": "1m",
    "job_gc_interval": "1m",
    "job_gc_threshold": "1m",
    "eval_gc_threshold": "1m",
    "deployment_gc_threshold": "1m"
  },
  "client": {
    "enabled": true,
    "servers": [
      "localhost"
    ],
    "gc_interval": "1m",
    "gc_disk_usage_threshold": 80,
    "gc_inode_usage_threshold": 70,
    "gc_max_allocs": 10
  }
}

# The job spec.
## NOTE: We will be modulating the "CHANGE_ME" string in the test script.
$ cat job.hcl
job "test_job" {
  region = "test_region"
  datacenters = ["test_datacenter"]
  type = "batch"

  group "one_and_only" {
    # Specify the number of these tasks we want.
    count = 1

    task "test" {
      driver = "raw_exec"

      config {
        command = "/usr/bin/echo"
        args = ["CHANGE_ME"]
      }
    }
  }
}

Test script:

# A bunch of bash to make the repro simpler
$ cat run_test.sh
#!/bin/bash

set -euo pipefail

SUMMARY_FILE="/tmp/summary"

echo "TIME EVAL_COUNT ALLOC_COUNT" > $SUMMARY_FILE
for value in {1..300}
do
  # Prepare the job spec
  cp job.hcl job_tmp.hcl
  sed -i job_tmp.hcl -e "s/CHANGE_ME/$value/"

  # Run the job
  ./nomad job run job_tmp.hcl

  # Collect eval and alloc stats.
  EVAL_COUNT=$(curl http://localhost:4646/v1/job/test_job/evaluations?namespace=default | jq '. | length')
  ALLOC_COUNT=$(curl http://localhost:4646/v1/job/test_job/allocations?all=true\&namespace=default | jq '. | length')
  echo "'$(date)' $EVAL_COUNT $ALLOC_COUNT" >> $SUMMARY_FILE
done

sleep 120
EVAL_COUNT=$(curl http://localhost:4646/v1/job/test_job/evaluations?namespace=default | jq '. | length')
ALLOC_COUNT=$(curl http://localhost:4646/v1/job/test_job/allocations?all=true\&namespace=default | jq '. | length')
echo "'$(date)' $EVAL_COUNT $ALLOC_COUNT" >> $SUMMARY_FILE

Test run:

# One terminal, run nomad in dev mode
$ ./nomad agent -dev -config ./nomad.hcl

# Another terminal run the script
$ ./run_test.sh

Observe the output (I truncated the middle 90 rows since they aren't particularly interesting (they're exactly as one would assume):

TIME EVAL_COUNT ALLOC_COUNT                                                                                                                                                                                                                                                                    [0/895]
'Tue Nov  8 08:08:35 2022' 1 1                                                                                                                                                                                                                                                                        
'Tue Nov  8 08:08:36 2022' 2 2                                                                                                                                                                                                                                                                        
'Tue Nov  8 08:08:38 2022' 3 3                                                                                                                                                                                                                                                                        
'Tue Nov  8 08:08:39 2022' 4 4                                                                                                                                                                                                                                                                        
'Tue Nov  8 08:08:40 2022' 5 5   
...
'Tue Nov  8 08:10:11 2022' 95 95                                                                                                                                                                                                                                                                      
'Tue Nov  8 08:10:13 2022' 96 96                                                                                                                                                                                                                                                                      
'Tue Nov  8 08:10:14 2022' 97 97                                                                                                                                                                                                                                                                      
'Tue Nov  8 08:10:14 2022' 98 98                                                                                                                                                                                                                                                                      
'Tue Nov  8 08:10:15 2022' 99 99                                                                                                                                                                                                                                                                      
'Tue Nov  8 08:10:16 2022' 100 100                                                                                                                                                                                                                                                                    
'Tue Nov  8 08:12:16 2022' 100 100  

Please notice that:

  1. Evals are never GCed
  2. Allocs are never GCed

I have previously let this sit around for 30 more minutes and the outcome is the same. I have ran this over a week on a tiny deployment with the same results (except much larger numbers). The cluster eventually runs out of memory -- both client and scheduler.

The GC limits are such that both evals and allocs should've been GCed (except for the most recent one which is necessary for no rescheduling of the tasks).

Here is the output of the nomad CLI at the end of the test for a summary of the job information in case it is helpful:

$ ./nomad job status test_job                                                                           
ID            = test_job                                                                                                                           
Name          = test_job                                                                                                                           
Submit Date   = 2022-11-08T08:10:15-05:00                                                                                                          
Type          = batch                                                                                                                              
Priority      = 50                                                                                                                                 
Datacenters   = test_datacenter                                                                                                                    
Namespace     = default                                                                                                                            
Status        = dead                                                                                                                               
Periodic      = false                                                                                                                              
Parameterized = false                                                                                                                              
                                                                                                                                                   
Summary                                                                                                                                            
Task Group    Queued  Starting  Running  Failed  Complete  Lost  Unknown                                                                           
one_and_only  0       0         0        0       100       0     0                                                                                 
                                                                                                                                                   
Allocations                                                                                                                                        
ID        Node ID   Task Group    Version  Desired  Status    Created    Modified 
c8b8c1be  16a7c8a5  one_and_only  99       run      complete  3m28s ago  3m27s ago
b99128cb  16a7c8a5  one_and_only  98       run      complete  3m29s ago  3m29s ago
8cd6c875  16a7c8a5  one_and_only  97       stop     complete  3m29s ago  3m29s ago
b161a2fd  16a7c8a5  one_and_only  96       run      complete  3m30s ago  3m30s ago
... <a lot more rows as you'd expect>

@stswidwinski
Copy link
Contributor Author

@lgfa29 / @jrasell -- did you have a chance to take a look at this again?

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 17, 2022

Hi @stswidwinski,

apologies for the delay here, I was going through internal documents and conversations to understand why the change in 2409f72 was needed, and I think it relates to a specific issue at large clusters, but I haven't been able to find much information yet.

On a first pass I think your code changes make sense, but I want to make sure we're not causing some edge case regression that was supposed to be fixed by 2409f72. I will try to get back to it next week.

@stswidwinski
Copy link
Contributor Author

@lgfa29 any luck? :)

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 1, 2022

Hi @stswidwinski,

After checking with the team, we think it may be better to leave this part of the logic as-is and introduce a new configuration value (something like batch_eval_gc_threshold) to force the deletion of old allocs and evals regardless of job status, indexes etc.

To reduce the effects on existing deployments, this new configuration value should be set to something high, like 168h (7 days) and we should note this as a breaking change to keep folks time to adjust it before upgrade if necessary.

The reason for this approach is that it's hard for us to tell what people expect out of the GC for batch jobs. Some case may benefit from having a long history, while others may prefer to reap them quickly. An explicit configuration value makes the behaviour more clear to users, and allow them to fine-tune it for their needs, and for us developers as well since the logic becomes more straight forward.

In terms of code changes, in addition to creating and plumbing the configuration all the way to the server, I think we would only need a new if check here: https://github.com/hashicorp/nomad/blob/v1.4.3/nomad/core_sched.go#L301-L330. We will probably need to pass another threshold as well.

What do you think of this approach?

@stswidwinski
Copy link
Contributor Author

stswidwinski commented Dec 2, 2022

Hey @lgfa29,

Thank you for the response and the thought process. This makes sense to me. I'll make the changes and update the PR accordingly. I hope to get to this today, but if not... I'm highly likely to do this by EOW next week.

I believe that there are still some (perhaps tangential) issues with the existing logic for pruning evaluations (reference to evaluation.Job post deserialization but without adjusting the pointer may cause segfaults or random values). Let's discuss them directly on the updated PR to make it concrete.

Thanks again!

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 2, 2022

Sounds good! Thanks for all the work you've done 🙂

stswidwinski added a commit to stswidwinski/nomad that referenced this issue Dec 5, 2022
…ows us to control the pace of GC of Batch Evals independent of other evals.
@stswidwinski
Copy link
Contributor Author

Updated the PR, added tests and pulled the configuration value from the command through the scheduler. I'm not 100% confident that this is enough given that I am relatively new to the entire codebase (awaiting test chain). Please let me know if I misunderstood your comment above and/or if there are other changes you would like to see in the PR.

@tgross
Copy link
Member

tgross commented Jan 31, 2023

Fixed in #15097 which will ship in Nomad 1.5.0 (with backports)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

3 participants