Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post Stop tasks are never killed on the client node (also at GC time) resulting in task events loss. #17971

Closed
stswidwinski opened this issue Jul 18, 2023 · 3 comments · Fixed by #18005

Comments

@stswidwinski
Copy link
Contributor

stswidwinski commented Jul 18, 2023

Nomad version

Tip.

Operating system and Environment details

Linux.

Issue

When the client Nomad node garbage collects an allocation, any post-stop lifecycle tasks:

  1. Are not killed,
  2. Lose task state events.

While one may argue that #1 is intended, #2 is most likely not and seems to be a bug.

Reproduction steps

First, start up a single server and a single client. It is important that the client has a low max_alloc_gc so that we can easily evoke the garbage collection.

# Start up a server
$ echo 'data_dir = "/tmp/server"                                                                                                                                                                                                                                                                     
log_level = "TRACE"                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                      
advertise {                                                                                                                                                                                                                                                                                           
  http = "127.0.0.1"                                                                                                                                                                                                                                                                                  
  rpc = "127.0.0.1"                                                                                                                                                                                                                                                                                   
  serf = "127.0.0.1"                                                                                                                                                                                                                                                                                  
}                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                      
server {                                                                                                                                                                                                                                                                                              
  enabled = true                                                                                                                                                                                                                                                                                      
  bootstrap_expect = 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
}' > /tmp/server

$ nomad agent -config /tmp/server
 
# Start up a client
echo 'data_dir = "/tmp/client-1"                                                                                                                                                                                                                                                                   
log_level = "debug"                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                      
advertise {                                                                                                                                                                                                                                                                                           
  http = "127.0.0.1"                                                                                                                                                                                                                                                                                  
  rpc = "127.0.0.1"                                                                                                                                                                                                                                                                                   
  serf = "127.0.0.1"                                                                                                                                                                                                                                                                                  
}                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                      
ports {                                                                                                                                                                                                                                                                                               
  http = "9876"                                                                                                                                                                                                                                                                                       
  rpc = "9875"                                                                                                                                                                                                                                                                                        
  serf = "9874"                                                                                                                                                                                                                                                                                       
}                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                      
client {                                                                                                                                                                                                                                                                                              
  enabled = true                                                                                                                                                                                                                                                                                      
  servers = ["127.0.0.1"]                                                                                                                                                                                                                                                                             
  gc_max_allocs = 1                                                                                                                                                                                                                                                                                   
}                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                      
plugin "raw_exec" {                                                                                                                                                                                                                                                                                   
  config {                                                                                                                                                                                                                                                                                            
    enabled = true                                                                                                                                                                                                                                                                                    
  }                                                                                                                                                                                                                                                                                                   
}'  > /tmp/client

$ nomad agent -config /tmp/client

Then start up a simple job that has a post start job configured.

$ cat job_spec.hcl
job "example" {
  datacenters = ["dc1"]
  type = "batch"
  
  group "test-group" {
    task "test-task" {
      driver = "raw_exec"
   
      config {
        command = "echo"
        args = [ "foo" ]
      }
    }
 
    task "poststop" {
      driver = "raw_exec"
 
      lifecycle {
        hook = "poststop"
        sidecar = false
      }
   
      config {
        command = "echo"
        args = [ "bar" ]
      }
    }
  }
}

$ nomad job run job_spec.hcl

If we now let this run, we can observe the correct output of the job state. In particular, we will notice that the task events are populated and give us plenty of information:

$ ./nomad alloc status -verbose 3c584bf6
ID                  = 3c584bf6-98fc-ccab-bf42-f7a8bfded3be
Eval ID             = bc5a4cc7-d36d-f5dc-0f68-865f1c2b73f5
Name                = example.test-group[0]
Node ID             = 4e77df47-432b-c730-c553-abfa12ab84bc
Node Name           = node
Job ID              = example
Job Version         = 0
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 2023-07-17T16:28:14-04:00
Modified            = 2023-07-17T16:28:14-04:00
Evaluated Nodes     = 1
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 63.471µs
Failures            = 0
 
Task "poststop" (poststop) is "dead"
Task Resources:
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB 
 
Task Events:
Started At     = 2023-07-17T20:28:14Z
Finished At    = 2023-07-17T20:28:14Z
Total Restarts = 0
Last Restart   = N/A
 
Recent Events:
Time                       Type        Description
2023-07-17T16:28:14-04:00  Terminated  Exit Code: 0
2023-07-17T16:28:14-04:00  Started     Task started by client
2023-07-17T16:28:14-04:00  Task Setup  Building Task Directory
2023-07-17T16:28:14-04:00  Received    Task received by client
 
Task "test-task" is "dead"
Task Resources:
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB 
 
Task Events:
Started At     = 2023-07-17T20:28:14Z
Finished At    = 2023-07-17T20:28:14Z
Total Restarts = 0
Last Restart   = N/A
 
Recent Events:
Time                       Type        Description
2023-07-17T16:28:14-04:00  Terminated  Exit Code: 0
2023-07-17T16:28:14-04:00  Started     Task started by client
2023-07-17T16:28:14-04:00  Task Setup  Building Task Directory
2023-07-17T16:28:14-04:00  Received    Task received by client
 
Placement Metrics
Node                                  binpack  job-anti-affinity  node-affinity  node-reschedule-penalty  final score
4e77df47-432b-c730-c553-abfa12ab84bc  0.00497  0                  0              0                        0.00497

Now, let's start anything else. It doesn't really matter what happens to this second job -- all we care about is that the garbage collection of the allocation on the client occurs and this is guaranteed by max_alloc_gc = 1 setting on the client. Once the new job is ran, we will see that the exact same query to the above endpoint will now lack metadata:

# Look at the original information
$ ./nomad alloc status -verbose 3c584bf6
ID                  = 3c584bf6-98fc-ccab-bf42-f7a8bfded3be
Eval ID             = bc5a4cc7-d36d-f5dc-0f68-865f1c2b73f5
Name                = example.test-group[0]
Node ID             = 4e77df47-432b-c730-c553-abfa12ab84bc
Node Name           = node
Job ID              = example
Job Version         = 0
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 2023-07-17T16:28:14-04:00
Modified            = 2023-07-17T16:29:08-04:00
Evaluated Nodes     = 1
Filtered Nodes      = 0
Exhausted Nodes     = 0
Allocation Time     = 63.471µs
Failures            = 0
 
Task "poststop" (poststop) is ""
Task Resources:
CPU        Memory       Disk     Addresses
0/100 MHz  0 B/300 MiB  300 MiB 
 
Task Events:
Started At     = N/A
Finished At    = 2023-07-17T16:29:08-04:00
Total Restarts = 0
Last Restart   = N/A
 
Recent Events:
Time  Type  Description
 
[...]

Or in json form:

$ ./nomad alloc status -json -verbose 3c584bf6 | jq .TaskStates.poststop
{
  "Events": null,
  "Failed": false,
  "FinishedAt": "2023-07-17T16:29:08.347491978-04:00",
  "LastRestart": null,
  "Restarts": 0,
  "StartedAt": null,
  "State": "",
  "TaskHandle": null
}

Which contains illegal, uninitialized values.

Expected Result

The post stop task is stopped and the metadata is preserved.

Actual Result

It is not stopped and the metadata is not preserved.

Rootcause

The client GC logic is quite simple. One looks through allocs that may be GCed, pick one and then run the Destroy logic on it:

https://github.com/hashicorp/nomad/blob/v1.5.5/client/gc.go#L183

This destroy logic is also simple:

  1. It kills and awaits the leader tasks
  2. It kills and awaits non-sidecar and non-poststop tasks
  3. It kills and awaits sidecar tasks
  4. The comments claim that post stop tasks get eventually killed and awaited except... They don't get killed and awaited.

https://github.com/hashicorp/nomad/blob/v1.5.5/client/allocrunner/alloc_runner.go#L661-L736

This function returns task states with the added kill states. Except, since poststop, prestart and poststart tasks never got proper treatment these states are missing entirely and the task events get overwritten which causes them to disappear. I think the fix for this would be to either

  1. Explicitly add the states for post-stop tasks to the map (that is, make it a copy at the initialization point)
  2. Or flip this single condition on ts.IsPoststopTask so that they are killed alongside sidecar tasks which is what the comments claim was the intended behavior. (https://github.com/hashicorp/nomad/blob/v1.5.5/client/allocrunner/alloc_runner.go#L714)

I believe that the first treatment would be less invasive and less "different" from what is happening today. However, the second treatment seems to be the more correct one since right now these tasks simply do not get killed, but this assumes that we need to split the regular stop behavior from that which occurs at GC time

@stswidwinski stswidwinski changed the title Post Stop tasks not stopped at GC time on the client node. Post Stop, Pre Start and Post Start tasks miss Task Events after GC time on the client node. Jul 18, 2023
@stswidwinski stswidwinski changed the title Post Stop, Pre Start and Post Start tasks miss Task Events after GC time on the client node. Post Stop, Pre Start and Post Start tasks miss Task Events after alloc is GCed on the client node. Jul 18, 2023
@stswidwinski stswidwinski changed the title Post Stop, Pre Start and Post Start tasks miss Task Events after alloc is GCed on the client node. Post Stop, non-sidecar tasks are never killed on the client node. Jul 18, 2023
@stswidwinski stswidwinski changed the title Post Stop, non-sidecar tasks are never killed on the client node. Post Stop tasks are never killed on the client node. Jul 18, 2023
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jul 19, 2023
@stswidwinski stswidwinski changed the title Post Stop tasks are never killed on the client node. Post Stop tasks are never killed on the client node (also at GC time) resulting in task events loss. Jul 19, 2023
@schmichael
Copy link
Member

Is it ok if we ignore the 1. Is not killed aspect because as far as I can tell that's working as intended: if I change the post stop task's command to sleep 120, then starting a second job will not kill the alloc still running its post stop task. This is by design and we log the temporary bypass of gc_max_allocs:

    2023-07-20T16:49:49.099-0700 [WARN]  client.gc: garbage collection skipped because no terminal allocations: reason="number of allocations (2) is over the limit (1)"

However if I use your jobspec so that the poststop task, and therefore the whole alloc, is terminal ("dead" and "complete" respectively), I can reproduce your corrupted task events state. So there's definitely a bug here, I'm just not sure it's related to killing poststop tasks.

For anyone lazy like me if you create a file maxallocs.hcl containing:

client {
  gc_max_allocs = 1
}

You can use that with nomad agent -dev -config maxallocs.hcl to reproduce this issue.

@stswidwinski
Copy link
Contributor Author

Thank you for taking a look!

Is it ok if we ignore the 1. Is not killed aspect because as far as I can tell that's working as intended: if I change the post stop task's command to sleep 120, then starting a second job will not kill the alloc still running its post stop task. This is by design and we log the temporary bypass of gc_max_allocs: [...]

I agree that this isn't the core of the problem and I think it is quite OK for us to ignore it as a stand-alone problem. The primary cause for me mentioning the killing is what I believe to be the root of the issue within the implementation -- I hope that the description and PR make it somewhat more clear, but please do let me know if you think this is wrong or otherwise confusing.

Philosophically speaking, unbounded run-times of post-stop tasks are somewhat bothersome in that the task spec itself may specify values which imply that they are meant to be stopped (for instance, kill_timeout), whereas in reality they will never be stopped without a "special" event (such as an OOM kill).

This may of course be circumvented by using /usr/bin/timeout or similar, but to evoke the same behavior of SIGINT/SIGTERM followed by later SIGKILL (consistent with the normal task treatment) is somewhat of an exercise in re-implementation the existing functionality.

However if I use your jobspec so that the poststop task, and therefore the whole alloc, is terminal ("dead" and "complete" respectively), I can reproduce your corrupted task events state. So there's definitely a bug here, I'm just not sure it's related to killing poststop tasks. [...]

The primary reason for this assertion is dlv which shows me that this is the place at which the first illegal mutation occurs. With the PR submitted I can also no longer reproduce the issue which further reinforces my belief. Please let me know if this makes sense / is unclear!

@schmichael
Copy link
Member

schmichael commented Jul 21, 2023

whereas in reality they will never be stopped without a "special" event (such as an OOM kill).

This is a great feature request and similar to this issue despite being for batch jobs: #1782. I think making timeouts at the task level makes the most sense precisely for things like post-stop tasks.

Another way to frame this is that time is an unconstrained resource in Nomad and we should offer to constrain it like cpu or memory or filesystems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants