Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart #5593

cheeseprocedure · 2019-04-22T21:50:42Z

Nomad version

Nomad v0.9.0 (18dd59056ee1d7b2df51256fe900a98460d3d6b9)

Operating system and Environment details

Ubuntu 16.04LTS (our prod environment)
macOS Mojave (reproduction case).

Issue

tl;dr: processes launched by raw_exec tasks are orphaned if the Nomad agent restarts and a job stop/update stops allocations on the host that were live before the agent restart.

We use Nomad to run a simple guest VM supervisor process via raw_exec. After deploying Nomad 0.9 to a limited set of datacenters, a monitor started firing intermittently which indicated an instance of our supervisor found running without a Nomad executor parent process.

After some digging, we found this was happening to allocations started prior to a restart of the local Nomad agent. A while back, we tweaked our configuration management tooling as a workaround for #4413 so Nomad agents running in client mode would simply restart (regardless of whether TLS key material had been updated). Because of this behaviour, we often have allocations running that are older than the current Nomad agent process.

Allocations impacted by this issue have an event log much like this:

Recent Events:
Time                  Type        Description
2019-04-18T18:51:17Z  Killed      Task successfully killed
2019-04-18T18:51:17Z  Terminated  Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: error while dialing: dial unix /tmp/plugin415396184: connect: connection refused\""
2019-04-18T18:51:12Z  Killing     Sent interrupt
2019-04-17T18:42:49Z  Started     Task started by client
2019-04-17T18:42:48Z  Task Setup  Building Task Directory
2019-04-17T18:42:48Z  Received    Task received by client

Despite Nomad reporting Task successfully killed, the process launched by the raw_exec driver is still running at this point.

When Nomad reports Sent interrupt, the allocation's process does receive the appropriate signal; our tooling kicks off graceful shutdown/eventual forced termination of our guest VMs, so the orphaned process doesn't stick around for long. (This is why our orphaned process monitor was flapping.)

While submitting this issue, I realized the same failure mode occurs after restarting the Nomad agent and calling nomad stop [job_name] to stop that job's allocations on the agent's host. That's the case documented below.

Reproduction steps

Configs/resources

/tmp/nomad_config.json:

{
    "datacenter": "macbook",
    "data_dir": "/tmp/nomad_data",
    "disable_update_check": true,
    "enable_syslog": true,
    "leave_on_terminate": true,
    "syslog_facility": "LOCAL0",
    "log_level": "DEBUG",
    "server": {
        "bootstrap_expect": 1,
        "enabled": true
    },
    "client": {
        "enabled": true,
        "options": {
            "driver.whitelist": "raw_exec",
            "driver.raw_exec.enable": "1"
        }
    }
}

/tmp/testjob.sh:

#!/bin/bash

function log {
    echo "$(date) - testjob - $(echo ${NOMAD_GROUP_NAME}) - $(echo ${NOMAD_ALLOC_ID} | cut -f1 -d-) - PID ${$?} - ${1}" \
        | tee -a /tmp/testjob.log
}

function handleint {
    log "received interrupt but ignoring it"
}
trap handleint INT

SLEEPTIME="5"
while true; do
    log "sleeping for ${SLEEPTIME} seconds..."
    sleep "${SLEEPTIME}"
done

Procedure

Run the Nomad agent in its own terminal session: nomad agent -config /tmp/nomad_config.json
Run the simple job definition below: nomad run /tmp/testjob.hcl
tail -f /tmp/testjob.log to confirm allocation is live and generating output
In the Nomad agent's terminal session, hit Ctrl-C to exit gracefully; once the agent has exited, run it again.
Stop the Nomad job: nomad stop testjob
/tmp/testjob.log should immediately include a received interrupt but ignoring it line from the current allocation, and nomad status [alloc_id] should include aSent interrupt event.
Wait a few seconds (if necessary), then look at the allocation status again. It should include a Task successfully killed event following an RPC error with all SubConns are in TransientFailure and /tmp/testjob.log should continue showing output from the still-running bash script (now with a parent PID of 1) launched by the supposedly-complete allocation:

2019-04-22T14:08:28-07:00 Terminated Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: error while dialing: dial unix /var/folders/f1/004zzwdd3ml7x9j1q8thh3b00000gn/T/plugin954529704: connect: connection refused\""

Job file (if appropriate)

/tmp/testjob.hcl:

job "testjob" {
  datacenters = ["macbook"]
  type = "service"
  update {
    stagger      = "3s"
    max_parallel = 1
  }
  group "group1" {
    task "task1" {
      kill_timeout = "30s"
      driver = "raw_exec"
      config {
        command = "/tmp/testjob.sh"
      }
      resources {
        cpu = 50
      }
    }
  }
}

Nomad Client logs (if appropriate)

Nomad agent logs after a Nomad client restart and nomad stop testjob:

    2019-04-22T14:43:50.921-0700 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=testjob duration=1.000755ms
    2019-04-22T14:43:50.924-0700 [DEBUG] http: request complete: method=GET path=/v1/job/testjob duration=446.263µs
    2019-04-22T14:43:50.928-0700 [DEBUG] worker: dequeued evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
    2019-04-22T14:43:50.928-0700 [DEBUG] http: request complete: method=DELETE path=/v1/job/testjob?purge=false duration=2.704206ms
    2019-04-22T14:43:50.928-0700 [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c job_id=testjob namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 1)
Deployment Update for ID "9d3f5652-5912-9e71-32b0-ad1499e8155f": Status "cancelled"; Description "Cancelled because job is stopped"
Desired Changes for "group1": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "group2": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)"
    2019-04-22T14:43:50.930-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c duration=289.389µs
    2019-04-22T14:43:50.931-0700 [DEBUG] worker: submitted plan for evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
    2019-04-22T14:43:50.931-0700 [DEBUG] worker.service_sched: setting eval status: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c job_id=testjob namespace=default status=complete
    2019-04-22T14:43:50.931-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c/allocations duration=169.652µs
    2019-04-22T14:43:50.933-0700 [DEBUG] worker: updated evaluation: eval="<Eval "8ca24f07-4e90-a5d5-00ea-47ebfd2be47c" JobID: "testjob" Namespace: "default">"
    2019-04-22T14:43:50.933-0700 [DEBUG] worker: ack evaluation: eval_id=8ca24f07-4e90-a5d5-00ea-47ebfd2be47c
    2019-04-22T14:43:50.933-0700 [DEBUG] client: updated allocations: index=1460 total=5 pulled=5 filtered=0
    2019-04-22T14:43:50.933-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=5 ignored=0
    2019-04-22T14:43:50.937-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=5 ignored=0 errors=0
    2019-04-22T14:43:50.939-0700 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 driver=raw_exec task_name=task1 error="rpc error: code = Unavailable desc = transport is closing"
    2019-04-22T14:43:50.940-0700 [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1 backoff=5s error="executor Shutdown failed: rpc error: code = Unavailable desc = transport is closing"
    2019-04-22T14:43:50.940-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.stats_hook: error fetching stats of task: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1 error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial unix /var/folders/f1/004zzwdd3ml7x9j1q8thh3b00000gn/T/plugin630219183: connect: connection refused""
    2019-04-22T14:43:51.185-0700 [DEBUG] client: updated allocations: index=1462 total=5 pulled=4 filtered=1
    2019-04-22T14:43:51.186-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
    2019-04-22T14:43:51.188-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0
    2019-04-22T14:43:51.926-0700 [DEBUG] client.driver_mgr.raw_exec.executor: reattached plugin process exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 driver=raw_exec task_name=task1
    2019-04-22T14:43:51.936-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c duration=230.683µs
    2019-04-22T14:43:51.937-0700 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8ca24f07-4e90-a5d5-00ea-47ebfd2be47c/allocations duration=92.024µs
    2019-04-22T14:43:55.944-0700 [INFO ] client.gc: marking allocation for GC: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6
    2019-04-22T14:43:56.186-0700 [DEBUG] client: updated allocations: index=1463 total=5 pulled=4 filtered=1
    2019-04-22T14:43:56.186-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
    2019-04-22T14:43:56.189-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0
    2019-04-22T14:43:58.931-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: reattached plugin process exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
    2019-04-22T14:43:58.931-0700 [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
    2019-04-22T14:43:58.932-0700 [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=65a11666-b62d-f489-4bcd-9b81ea8384f6 task=task1
    2019-04-22T14:43:59.187-0700 [DEBUG] client: updated allocations: index=1464 total=5 pulled=4 filtered=1
    2019-04-22T14:43:59.187-0700 [DEBUG] client: allocation updates: added=0 removed=0 updated=4 ignored=1
    2019-04-22T14:43:59.190-0700 [DEBUG] client: allocation updates applied: added=0 removed=0 updated=4 ignored=1 errors=0

The text was updated successfully, but these errors were encountered:

Fixes #5593 Executor seems to die unexpectedly after nomad agent dies or is restarted. The crash seems to occur at the first log message after the nomad agent dies. To ease debugging we forward executor log messages to executor.log as well as to Stderr. `go-plugin` sets up plugins with Stderr pointing to a pipe being read by plugin client, the nomad agent in our case[1]. When the nomad agent dies, the pipe is closed, and any subsequent executor logs fail with ErrClosedPipe and SIGPIPE signal. SIGPIPE results into executor process dying. I considered adding a handler to ignore SIGPIPE, but hc-log library currently panics when logging write operation fails[2] This we opt to revert to v0.8 behavior of exclusively writing logs to executor.log, while we investigate alternative options. [1] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-plugin/client.go#L528-L535 [2] https://github.com/hashicorp/nomad/blob/v0.9.0/vendor/github.com/hashicorp/go-hclog/int.go#L320-L323

preetapan · 2019-04-23T17:05:07Z

@cheeseprocedure - We just released 0.9.1-rc which resolves this issue, available at https://releases.hashicorp.com/nomad/0.9.1-rc1/. Would be great if you can try it out and let us know if it fixed what you saw.

cheeseprocedure · 2019-04-23T18:14:41Z

Thanks @preetapan - I've been unable to reproduce these failures on 0.9.1-rc1!

github-actions · 2022-11-24T02:20:14Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

cheeseprocedure changed the title ~~Nomad 0.9: raw_exec task processes leaking after agent restart~~ Nomad 0.9: raw_exec task processes leaking from stopped allocations after agent restart Apr 22, 2019

cheeseprocedure changed the title ~~Nomad 0.9: raw_exec task processes leaking from stopped allocations after agent restart~~ Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart Apr 22, 2019

notnoop mentioned this issue Apr 23, 2019

fix crash when executor parent nomad process dies #5598

Merged

notnoop closed this as completed in #5598 Apr 23, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart #5593

Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart #5593

cheeseprocedure commented Apr 22, 2019 •

edited

Loading

preetapan commented Apr 23, 2019

cheeseprocedure commented Apr 23, 2019

github-actions bot commented Nov 24, 2022

Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart #5593

Nomad 0.9: raw_exec task processes leaking from allocations stopped after agent restart #5593

Comments

cheeseprocedure commented Apr 22, 2019 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Configs/resources

Procedure

Job file (if appropriate)

Nomad Client logs (if appropriate)

preetapan commented Apr 23, 2019

cheeseprocedure commented Apr 23, 2019

github-actions bot commented Nov 24, 2022

cheeseprocedure commented Apr 22, 2019 •

edited

Loading