Post Stop tasks are never killed on the client node (also at GC time) resulting in task events loss. #17971

stswidwinski · 2023-07-18T22:21:02Z

Nomad version

Tip.

Operating system and Environment details

Linux.

Issue

When the client Nomad node garbage collects an allocation, any post-stop lifecycle tasks:

Are not killed,
Lose task state events.

While one may argue that #1 is intended, #2 is most likely not and seems to be a bug.

Reproduction steps

First, start up a single server and a single client. It is important that the client has a low max_alloc_gc so that we can easily evoke the garbage collection.

# Start up a server
$ echo 'data_dir = "/tmp/server"                                                                                                                                                                                                                                                                     
log_level = "TRACE"                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                      
advertise {                                                                                                                                                                                                                                                                                           
  http = "127.0.0.1"                                                                                                                                                                                                                                                                                  
  rpc = "127.0.0.1"                                                                                                                                                                                                                                                                                   
  serf = "127.0.0.1"                                                                                                                                                                                                                                                                                  
}                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                      
server {                                                                                                                                                                                                                                                                                              
  enabled = true                                                                                                                                                                                                                                                                                      
  bootstrap_expect = 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
}' > /tmp/server

$ nomad agent -config /tmp/server
 
# Start up a client
echo 'data_dir = "/tmp/client-1"                                                                                                                                                                                                                                                                   
log_level = "debug"                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                      
advertise {                                                                                                                                                                                                                                                                                           
  http = "127.0.0.1"                                                                                                                                                                                                                                                                                  
  rpc = "127.0.0.1"                                                                                                                                                                                                                                                                                   
  serf = "127.0.0.1"                                                                                                                                                                                                                                                                                  
}                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                      
ports {                                                                                                                                                                                                                                                                                               
  http = "9876"                                                                                                                                                                                                                                                                                       
  rpc = "9875"                                                                                                                                                                                                                                                                                        
  serf = "9874"                                                                                                                                                                                                                                                                                       
}                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                      
client {                                                                                                                                                                                                                                                                                              
  enabled = true                                                                                                                                                                                                                                                                                      
  servers = ["127.0.0.1"]                                                                                                                                                                                                                                                                             
  gc_max_allocs = 1                                                                                                                                                                                                                                                                                   
}                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                      
plugin "raw_exec" {                                                                                                                                                                                                                                                                                   
  config {                                                                                                                                                                                                                                                                                            
    enabled = true                                                                                                                                                                                                                                                                                    
  }                                                                                                                                                                                                                                                                                                   
}'  > /tmp/client

$ nomad agent -config /tmp/client

Then start up a simple job that has a post start job configured.

$ cat job_spec.hcl
job "example" {
  datacenters = ["dc1"]
  type = "batch"
  
  group "test-group" {
    task "test-task" {
      driver = "raw_exec"
   
      config {
        command = "echo"
        args = [ "foo" ]
      }
    }
 
    task "poststop" {
      driver = "raw_exec"
 
      lifecycle {
        hook = "poststop"
        sidecar = false
      }
   
      config {
        command = "echo"
        args = [ "bar" ]
      }
    }
  }
}

$ nomad job run job_spec.hcl

If we now let this run, we can observe the correct output of the job state. In particular, we will notice that the task events are populated and give us plenty of information:

$ ./nomad alloc status -verbose 3c584bf6
ID                  = 3c584bf6-98fc-ccab-bf42-f7a8bfded3be
Eval ID             = bc5a4cc7-d36d-f5dc-0f68-865f1c2b73f5
Name                = example.test-group[0]
Node ID             = 4e77df47-432b-c730-c553-abfa12ab84bc
Node Name           = node
Job ID              = example
Job Version         = 0
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 2023-07-17T16:28:14-04:00
Modified            = 2023-07-17T16:28:14-04:00
Evaluated Nodes     = 1
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 63.471µs
Failures            = 0
 
Task "poststop" (poststop) is "dead"
Task Resources:
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB 
 
Task Events:
Started At     = 2023-07-17T20:28:14Z
Finished At    = 2023-07-17T20:28:14Z
Total Restarts = 0
Last Restart   = N/A
 
Recent Events:
Time                       Type        Description
2023-07-17T16:28:14-04:00  Terminated  Exit Code: 0
2023-07-17T16:28:14-04:00  Started     Task started by client
2023-07-17T16:28:14-04:00  Task Setup  Building Task Directory
2023-07-17T16:28:14-04:00  Received    Task received by client
 
Task "test-task" is "dead"
Task Resources:
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB 
 
Task Events:
Started At     = 2023-07-17T20:28:14Z
Finished At    = 2023-07-17T20:28:14Z
Total Restarts = 0
Last Restart   = N/A
 
Recent Events:
Time                       Type        Description
2023-07-17T16:28:14-04:00  Terminated  Exit Code: 0
2023-07-17T16:28:14-04:00  Started     Task started by client
2023-07-17T16:28:14-04:00  Task Setup  Building Task Directory
2023-07-17T16:28:14-04:00  Received    Task received by client
 
Placement Metrics
Node                                  binpack  job-anti-affinity  node-affinity  node-reschedule-penalty  final score
4e77df47-432b-c730-c553-abfa12ab84bc  0.00497  0                  0              0                        0.00497

Now, let's start anything else. It doesn't really matter what happens to this second job -- all we care about is that the garbage collection of the allocation on the client occurs and this is guaranteed by max_alloc_gc = 1 setting on the client. Once the new job is ran, we will see that the exact same query to the above endpoint will now lack metadata:

# Look at the original information
$ ./nomad alloc status -verbose 3c584bf6
ID                  = 3c584bf6-98fc-ccab-bf42-f7a8bfded3be
Eval ID             = bc5a4cc7-d36d-f5dc-0f68-865f1c2b73f5
Name                = example.test-group[0]
Node ID             = 4e77df47-432b-c730-c553-abfa12ab84bc
Node Name           = node
Job ID              = example
Job Version         = 0
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 2023-07-17T16:28:14-04:00
Modified            = 2023-07-17T16:29:08-04:00
Evaluated Nodes     = 1
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 63.471µs
Failures            = 0
 
Task "poststop" (poststop) is ""
Task Resources:
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB 
 
Task Events:
Started At     = N/A
Finished At    = 2023-07-17T16:29:08-04:00
Total Restarts = 0
Last Restart   = N/A
 
Recent Events:
Time  Type  Description
 
[...]

Or in json form:

$ ./nomad alloc status -json -verbose 3c584bf6 | jq .TaskStates.poststop
{
  "Events": null,
  "Failed": false,
  "FinishedAt": "2023-07-17T16:29:08.347491978-04:00",
  "LastRestart": null,
  "Restarts": 0,
  "StartedAt": null,
  "State": "",
  "TaskHandle": null
}

Which contains illegal, uninitialized values.

Expected Result

The post stop task is stopped and the metadata is preserved.

Actual Result

It is not stopped and the metadata is not preserved.

Rootcause

The client GC logic is quite simple. One looks through allocs that may be GCed, pick one and then run the Destroy logic on it:

https://github.com/hashicorp/nomad/blob/v1.5.5/client/gc.go#L183

This destroy logic is also simple:

It kills and awaits the leader tasks
It kills and awaits non-sidecar and non-poststop tasks
It kills and awaits sidecar tasks
The comments claim that post stop tasks get eventually killed and awaited except... They don't get killed and awaited.

https://github.com/hashicorp/nomad/blob/v1.5.5/client/allocrunner/alloc_runner.go#L661-L736

This function returns task states with the added kill states. Except, since poststop, prestart and poststart tasks never got proper treatment these states are missing entirely and the task events get overwritten which causes them to disappear. I think the fix for this would be to either

Explicitly add the states for post-stop tasks to the map (that is, make it a copy at the initialization point)
Or flip this single condition on ts.IsPoststopTask so that they are killed alongside sidecar tasks which is what the comments claim was the intended behavior. (https://github.com/hashicorp/nomad/blob/v1.5.5/client/allocrunner/alloc_runner.go#L714)

I believe that the first treatment would be less invasive and less "different" from what is happening today. However, the second treatment seems to be the more correct one since right now these tasks simply do not get killed, but this assumes that we need to split the regular stop behavior from that which occurs at GC time

The text was updated successfully, but these errors were encountered:

schmichael · 2023-07-20T23:59:16Z

Is it ok if we ignore the 1. Is not killed aspect because as far as I can tell that's working as intended: if I change the post stop task's command to sleep 120, then starting a second job will not kill the alloc still running its post stop task. This is by design and we log the temporary bypass of gc_max_allocs:

    2023-07-20T16:49:49.099-0700 [WARN]  client.gc: garbage collection skipped because no terminal allocations: reason="number of allocations (2) is over the limit (1)"

However if I use your jobspec so that the poststop task, and therefore the whole alloc, is terminal ("dead" and "complete" respectively), I can reproduce your corrupted task events state. So there's definitely a bug here, I'm just not sure it's related to killing poststop tasks.

For anyone lazy like me if you create a file maxallocs.hcl containing:

client {
  gc_max_allocs = 1
}

You can use that with nomad agent -dev -config maxallocs.hcl to reproduce this issue.

stswidwinski · 2023-07-21T01:56:19Z

Thank you for taking a look!

Is it ok if we ignore the 1. Is not killed aspect because as far as I can tell that's working as intended: if I change the post stop task's command to sleep 120, then starting a second job will not kill the alloc still running its post stop task. This is by design and we log the temporary bypass of gc_max_allocs: [...]

I agree that this isn't the core of the problem and I think it is quite OK for us to ignore it as a stand-alone problem. The primary cause for me mentioning the killing is what I believe to be the root of the issue within the implementation -- I hope that the description and PR make it somewhat more clear, but please do let me know if you think this is wrong or otherwise confusing.

Philosophically speaking, unbounded run-times of post-stop tasks are somewhat bothersome in that the task spec itself may specify values which imply that they are meant to be stopped (for instance, kill_timeout), whereas in reality they will never be stopped without a "special" event (such as an OOM kill).

This may of course be circumvented by using /usr/bin/timeout or similar, but to evoke the same behavior of SIGINT/SIGTERM followed by later SIGKILL (consistent with the normal task treatment) is somewhat of an exercise in re-implementation the existing functionality.

However if I use your jobspec so that the poststop task, and therefore the whole alloc, is terminal ("dead" and "complete" respectively), I can reproduce your corrupted task events state. So there's definitely a bug here, I'm just not sure it's related to killing poststop tasks. [...]

The primary reason for this assertion is dlv which shows me that this is the place at which the first illegal mutation occurs. With the PR submitted I can also no longer reproduce the issue which further reinforces my belief. Please let me know if this makes sense / is unclear!

schmichael · 2023-07-21T17:20:02Z

whereas in reality they will never be stopped without a "special" event (such as an OOM kill).

This is a great feature request and similar to this issue despite being for batch jobs: #1782. I think making timeouts at the task level makes the most sense precisely for things like post-stop tasks.

Another way to frame this is that time is an unconstrained resource in Nomad and we should offer to constrain it like cpu or memory or filesystems.

stswidwinski added the type/bug label Jul 18, 2023

stswidwinski changed the title ~~Post Stop tasks not stopped at GC time on the client node.~~ Post Stop, Pre Start and Post Start tasks miss Task Events after GC time on the client node. Jul 18, 2023

stswidwinski changed the title ~~Post Stop, Pre Start and Post Start tasks miss Task Events after GC time on the client node.~~ Post Stop, Pre Start and Post Start tasks miss Task Events after alloc is GCed on the client node. Jul 18, 2023

stswidwinski changed the title ~~Post Stop, Pre Start and Post Start tasks miss Task Events after alloc is GCed on the client node.~~ Post Stop, non-sidecar tasks are never killed on the client node. Jul 18, 2023

stswidwinski changed the title ~~Post Stop, non-sidecar tasks are never killed on the client node.~~ Post Stop tasks are never killed on the client node. Jul 18, 2023

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 19, 2023

stswidwinski changed the title ~~Post Stop tasks are never killed on the client node.~~ Post Stop tasks are never killed on the client node (also at GC time) resulting in task events loss. Jul 19, 2023

stswidwinski mentioned this issue Jul 19, 2023

Retain task states for post stop tasks at the time of node GC #18005

Merged

schmichael added theme/client theme/task lifecycle labels Jul 20, 2023

schmichael moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jul 20, 2023

schmichael closed this as completed in #18005 Jul 21, 2023

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post Stop tasks are never killed on the client node (also at GC time) resulting in task events loss. #17971

Post Stop tasks are never killed on the client node (also at GC time) resulting in task events loss. #17971

stswidwinski commented Jul 18, 2023 •

edited

Loading

schmichael commented Jul 20, 2023

stswidwinski commented Jul 21, 2023

schmichael commented Jul 21, 2023 •

edited

Loading

Post Stop tasks are never killed on the client node (also at GC time) resulting in task events loss. #17971

Post Stop tasks are never killed on the client node (also at GC time) resulting in task events loss. #17971

Comments

stswidwinski commented Jul 18, 2023 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Rootcause

schmichael commented Jul 20, 2023

stswidwinski commented Jul 21, 2023

schmichael commented Jul 21, 2023 • edited Loading

stswidwinski commented Jul 18, 2023 •

edited

Loading

schmichael commented Jul 21, 2023 •

edited

Loading