Executor (e.g raw_exec driver) WaitTask returns an error and incorrect exit code #10430

benbuzbee · 2021-04-22T21:17:56Z

Nomad version

Dev build

$ ./bin/nomad version
Nomad v1.1.0-dev (f7437800eb6e676653fb73f2f946a8e75bf6a79d+CHANGES)

Issue

The executor that most drivers, including exec and raw_exec, use to wait on child processes' to exit does not wait correctly

Reproduction steps

sudo ./bin/nomad agent -dev
create bug.hcl - a runs a simple inline bash script that exits with code "123" when it receives SIGTERM or SIGINT - see "Job file" section for copy
./bin/nomad job run bug.hcl
./bin/nomad job stop bug
Inspect status of alloc
Note that exit code is incorrect, exit message shows an error in the executor wait function

Expected Result

Exit Code: 123, Message: Non-error

Actual Result

Note: Exit code incorrect, error from executor

$ ./bin/nomad status 3089
ID                  = 30892551-9007-3030-192b-9daaa26765ea
Eval ID             = b0b9c5e1
Name                = bug.bug[0]
Node ID             = c15bbff6
Node Name           = browser-devbox
Job ID              = bug
Job Version         = 18
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 20s ago
Modified            = 6s ago

Task "bug" is "dead"
Task Resources
CPU        Memory          Disk     Addresses
0/100 MHz  33 MiB/300 MiB  300 MiB  

Task Events:
Started At     = 2021-04-23T17:49:00Z
Finished At    = 2021-04-23T17:49:13Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2021-04-23T17:49:13Z  Killed      Task successfully killed
2021-04-23T17:49:13Z  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Canceled desc = grpc: the client connection is closing"
2021-04-23T17:49:05Z  Killing     Sent interrupt. Waiting 1m0s before force killing
2021-04-23T17:49:00Z  Started     Task started by client
2021-04-23T17:49:00Z  Task Setup  Building Task Directory
2021-04-23T17:49:00Z  Received    Task received by client

vagrant@browser-devbox:~/nomad$ ./bin/nomad logs 3089
Hello world
Sleeping now!
Goodbye world

Job file (if appropriate)

job "bug" {
  datacenters = ["dc1"]
  type        = "batch"

  group "bug" {

    task "bug" {
      driver      = "raw_exec"
      kill_signal = "SIGTERM"
      kill_timeout = "1m"
      config {
        command = "/bin/bash"
        args = [
          "${NOMAD_TASK_DIR}/bug.sh",
        ]
      }

      template {
        data = <<EOH
#!/bin/bash
echo "Hello world"
trap "echo Goodbye world && exit 123" EXIT
trap "echo Sleeping now! && sleep 10" SIGINT SIGTERM
sleep 1000&
wait $!

EOH

        destination = "local/bug.sh"
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

prestonp · 2021-04-23T16:39:50Z

It looks like the job is trying to do go run as root in which you probably don't have env vars set for GOPATH, GOROOT.

If you build the binary and run that it should work:

go build -o /path/to/bug bug.go

job "bug" {
  datacenters = ["dc1"]
  type        = "batch"

  group "bug" {

    task "bug" {
      driver      = "raw_exec"
      kill_signal = "SIGTERM"
      config {
        command = "/path/to/bug"
      }
    }
  }
}

Then the logs are what we expect after nomad job stop:

$ nomad logs ab0e149b
Hello world!
Goodbye world!

However, the exit code is still incorrectly reporting 0. This appears to be a valid bug

2021-04-23T16:33:57Z  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Canceled desc = grpc: the client connection is closing"

notnoop · 2021-04-25T16:04:37Z

Thanks @benbuzbee ! Good point - this is confusing indeed. Looking at your case and the code, I believe this is localized to allocs that were explicitly stopped by Nomad (due to user request or job upgrade); probably with the believe that exit code doesn't reflect the task state properly.

If an alloc is stopped, the Exit Message should be meaningful to users: e.g. "Alloc was stopped due to {job update|user request}", rather than the cryptic message.

Would love your suggestions on how to handle the exit code: How do you think about reporting 0 exit code (or removing the field) and including the true exit code in message? Reporting non-zero may signal to downstream monitoring services) that the task is failed and needs attention - operators may need to update their monitoring dashboard to account for job upgrades and other reasons of stopping. There is also a backward compatibility concerns impacting users unexpectedly.

benbuzbee · 2021-04-26T18:22:25Z

An example where the exit code is relevant even after stop: on shutdown, a service does some cleanup - it deletes entries in a database, or flushes state to a file. If that fails, someone might want to know from the exit code.

So for me the simplest thing would be for it to be non-zero so we can check that, but that's because I have no sympathy for backwards compatibility :) If there was a status message that indicated "Exit code 123 after stopped" I suppose that would be a good compromise. I do think it might be confusing though, e.g. if i were writing some automation to check the exitcode the API returns I would not expect this behavior.

notnoop · 2021-05-03T02:50:59Z

Thanks @benbuzbee for the explanation. I'm convinced :). Upon investigation, I noticed that the current behavior depends on the driver (docker seemed to report the proper exit code regardless) and some raciness due to order of operations. It's indeed best to report the exit code the app returns without re-interpretation by Nomad.

I have opened a PR to address the issue in #10494 . I'll discuss back-porting it to a 1.0.x releaes with the team. Let us know if you have any questions!

This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in #10430 . We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown. Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes. By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired. I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update. Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.

tgross · 2021-05-17T13:37:46Z

Shipped in 1.1.0-rc1

This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in #10430 . We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown. Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes. By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired. I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update. Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.

github-actions · 2022-10-19T02:44:41Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

benbuzbee added the type/bug label Apr 22, 2021

benbuzbee changed the title ~~Executor (e.g raw_exec driver) does not wait on process correctly~~ Executor (e.g raw_exec driver) WaitTask returns an error and incorrect exit code Apr 23, 2021

notnoop self-assigned this Apr 25, 2021

notnoop added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Apr 25, 2021

notnoop added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 25, 2021

notnoop moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Apr 25, 2021

notnoop mentioned this issue May 3, 2021

drivers: Capture exit code when task is killed #10494

Merged

tgross closed this as completed May 17, 2021

Nomad - Community Issues Triage automation moved this from In Progress to Done May 17, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executor (e.g raw_exec driver) WaitTask returns an error and incorrect exit code #10430

Executor (e.g raw_exec driver) WaitTask returns an error and incorrect exit code #10430

benbuzbee commented Apr 22, 2021 •

edited

Loading

prestonp commented Apr 23, 2021 •

edited

Loading

notnoop commented Apr 25, 2021

benbuzbee commented Apr 26, 2021

notnoop commented May 3, 2021

tgross commented May 17, 2021

github-actions bot commented Oct 19, 2022

Executor (e.g raw_exec driver) WaitTask returns an error and incorrect exit code #10430

Executor (e.g raw_exec driver) WaitTask returns an error and incorrect exit code #10430

Comments

benbuzbee commented Apr 22, 2021 • edited Loading

Nomad version

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

prestonp commented Apr 23, 2021 • edited Loading

notnoop commented Apr 25, 2021

benbuzbee commented Apr 26, 2021

notnoop commented May 3, 2021

tgross commented May 17, 2021

github-actions bot commented Oct 19, 2022

benbuzbee commented Apr 22, 2021 •

edited

Loading

prestonp commented Apr 23, 2021 •

edited

Loading