Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job still running though received kill signal #5395

Closed
FlankMiao opened this issue Mar 6, 2019 · 26 comments · Fixed by #5975
Closed

job still running though received kill signal #5395

FlankMiao opened this issue Mar 6, 2019 · 26 comments · Fixed by #5975

Comments

@FlankMiao
Copy link

FlankMiao commented Mar 6, 2019

Nomad version

0.9.0-beta3

Operating system and Environment details

OS: centos
OS version: 7.6.1810

Issue

driver [exec]
the job jobA: one per node.
I clicked the stop button in the nomad-ui,some are [completed] immediately, but others are still [running] though the job status is dead. like this:

nomad

here are the events of which is still running:

nomad1

log in the nomad client:
2019-03-06T08:56:52.296+0800 [ERROR] client.driver_mgr.exec: error receiving stream from Stats executor RPC, closing stream:alloc_id=ce203f54-78cd-fa32-aff0-391c20e219ba driver=exec task_name=artifact error="rpc error: code = Unavailable desc = transport is closing"

when I kill the process manually, it,s still running.

when I clicked the start button, the node which has the jobA still running will has a pending jobA. like this:
nomad2

the /system/gc or client/allocation/id/gc api doesn.t work.

when I remove the node which has the still running jobA and gc. somehow it all works fine.

@preetapan
Copy link
Contributor

@FlankMiao can you share a jobspec so we can try to reproduce it on our end?

@FlankMiao
Copy link
Author

FlankMiao commented Mar 7, 2019

@preetapan
the jobspec like this:
and this doesn,t happend on every node.

` job "artifact-agent" {
datacenters = ["dc1"]
type = "system"
namespace = "default"
update {
max_parallel = 1
min_healthy_time = "10s"
healthy_deadline = "3m"
auto_revert = true
canary = 0
}

group "artifact" {
restart {
attempts = 10
interval = "5m"
delay = "25s"
mode = "delay"
}
ephemeral_disk {
size = 3000
}

task "artifact" {
  driver = "exec"
	config {
			command = "artifact"
		}

		artifact {
			source = "....artifact.tar.gz"
		}

		service {
			  name = "artifact-agent"
			  tags = ["a:b"]
			  address_mode = "host"
			  port = "http"
			  check {
			    port = "http"
			    type		= "http"
			    name		= "check_at-http_health"
			    path		= "/ping"
			    interval	= "10s"
			    timeout	= "5s"
			    initial_status = "warning"

			    check_restart {
						limit = 3
						grace = "20s"
						ignore_warnings = false
			    }
			  }
		}

  logs {
    max_files     = 10
    max_file_size = 15
  }

  resources {
     cpu    = 500
     memory = 40
     network {
	port "http" {
	static = "8080"
    }
    }
  }
}

}
}`

@prologic
Copy link

Also running into a similar issue as well;

$ nomad -v
Nomad v0.9.0-beta2 (43aa7b138b4900a8726369520e22a9ddcefc5988+CHANGES)

See my comments on #5363

@preetapan
Copy link
Contributor

preetapan commented Mar 11, 2019

@FlankMiao I ran your job spec (but used my own executable for the artifact), and noticed that the container would get killed after it started but not always. It seems to be related to the memory (40), after I increased it, it seemed to stabilize. Could you see if increasing memory in the resource stanza to something like 75 makes this behavior go away? That'll help us determine next steps.

@chenjpu
Copy link

chenjpu commented Mar 11, 2019

I also recreated this anomaly. When I restart the machine, nomad doesn't seem to be able to kill the old allocation normally.And this exception allocation appears as running.

image

@FlankMiao
Copy link
Author

@preetapan I have tried, but this still happens.

@preetapan
Copy link
Contributor

can you share all the logs from the client on and after the time you stopped the job? Its been hard to reproduce this (a couple of us on the team tried).
Also, can you confirm that when you manually (i.e outside of nomad) send a SIGINT to the executable you are running, it is able to terminate?

@FlankMiao
Copy link
Author

FlankMiao commented Mar 11, 2019

@preetapan it,s hard to reproduce, and this could happen on different job. this time is the job: promtail.
when I click the stop button on the nomad-ui:

Here is the log from nomad client:

2019-03-11T12:17:06.863+0800 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=6d3679d1-f40e-f2ce-080c-3829874ee7ac driver=raw_exec task_name=promtail error="rpc error: code = Unavailable desc = transport is closing"
2019-03-11T12:17:06.960+0800 [INFO ] client.gc: marking allocation for GC: alloc_id=6d3679d1-f40e-f2ce-080c-3829874ee7ac
2019/03/11 12:17:06.967491 [INFO] (runner) stopping
2019/03/11 12:17:06.967606 [INFO] (runner) received finish

here is log in the noamd server:

2019-03-11T12:17:06.785+0800 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=3228fc8d-dff6-7517-88f4-be81dbb47cdb driver=raw_exec task_name=promtail error="rpc error: code = Unavailable desc = transport is closing"
2019-03-11T12:17:06.791+0800 [INFO ] client.gc: marking allocation for GC: alloc_id=3228fc8d-dff6-7517-88f4-be81dbb47cdb
2019/03/11 12:17:06.798793 [INFO] (runner) stopping
2019/03/11 12:17:06.798926 [INFO] (runner) received finish

here is the state from nomad-ui:
nomad1
nomad2

here is the alloc dir:
nomad3

the job with allocid: 08011041 doesn.t exist in it.

when I click the start button:

logs in the nomad server:

2019/03/11 12:31:56.629972 [INFO] (runner) creating new runner (dry: false, once: false)
2019/03/11 12:31:56.630079 [INFO] (runner) creating watcher
2019/03/11 12:31:56.630197 [INFO] (runner) starting
2019/03/11 12:31:56.632365 [INFO] (runner) rendered "(dynamic)" =>"/var/lib/mos/nomad/alloc/6e879dbd-59c9-c9b1-ca76-a2923e9c396a/promtail/local/promtail.yml"
2019-03-11T12:31:56.633+0800 [INFO ] client.driver_mgr.raw_exec: starting task: driver=raw_exec

Now there are two run on the node, one is the 08011xx, other is the new started. like this:
nomad4

I killed the process outside nomad with [kill command], and it,s still shows in the nomad.

@prologic
Copy link

I killed the process outside nomad with [kill command], and it,s still shows in the nomad.

See the separate issue I created; I've been able to reproduce this exact behaviour. NB: I created a separate issue to capture more precise behaviour of what I was observing (Nomad not observing OOM kills by the OOM killer or SIGKLL in general).

@prologic
Copy link

In our specific case are are getting bitten by the Kernel OOM Killer as our memory reservations (Using the exec driver) are too low and we have resource intensive Java micro-services. Our work-around (for Nomad's lack of observing OOM kills or SIGKILL) is to fine tune our memory reservations and Java memory usage controls (heap size, etc) to more reasonable values.

In general as well; don't use kill -KILL outside of Nomad.

@urjitbhatia
Copy link

urjitbhatia commented May 8, 2019

We're seeing this as well (When jobs are OOM killed)

    2019/05/02 16:08:23.713514 [INFO] client: task "api_runner_hostnet" for alloc "c16f63c8-20e1-f9f5-edb1-17a8a13c73bc" failed: Wait returned exit code 137, signal 0, and error OOM Killed
    2019/05/02 16:08:23.713535 [INFO] client: Restarting task "api_runner_hostnet" for alloc "c16f63c8-20e1-f9f5-edb1-17a8a13c73bc" in 5.973486006s
2019-05-02T16:08:29.689Z [DEBUG] plugin: starting plugin: path=/usr/local/bin/nomad args="[/usr/local/bin/nomad executor {"LogFile":"/opt/nomad/data/alloc/c16f63c8-20e1-f9f5-edb1-17a8a13c73bc/api_runner_hostnet/executor.out","LogLevel":"INFO"}]"
2019-05-02T16:08:29.690Z [DEBUG] plugin: waiting for RPC address: path=/usr/local/bin/nomad
2019-05-02T16:08:29.720Z [DEBUG] plugin.nomad: plugin address: timestamp=2019-05-02T16:08:29.720Z address=/tmp/plugin869775604 network=unix
    2019/05/02 16:08:30.295859 [INFO] driver.docker: created container 599902680a37be03e7f9d9e93d9f3cebf2d3341c0151aeeb471d26bbbb7ef1ea
    2019/05/02 16:08:30.664965 [INFO] driver.docker: started container 599902680a37be03e7f9d9e93d9f3cebf2d3341c0151aeeb471d26bbbb7ef1ea
2019-05-02T16:08:45.045Z [DEBUG] plugin.nomad: 2019/05/02 16:08:45 [ERR] plugin: plugin server: accept unix /tmp/plugin424224757: use of closed network connection
2019-05-02T16:08:45.047Z [DEBUG] plugin: plugin process exited: path=/usr/local/bin/nomad
2019-05-02T16:08:45.063Z [DEBUG] plugin.nomad: 2019/05/02 16:08:45 [ERR] plugin: plugin server: accept unix /tmp/plugin633237035: use of closed network connection
2019-05-02T16:08:45.066Z [DEBUG] plugin: plugin process exited: path=/usr/local/bin/nomad
2019-05-02T16:08:45.068Z [DEBUG] plugin.nomad: 2019/05/02 16:08:45 [ERR] plugin: plugin server: accept unix /tmp/plugin610706192: use of closed network connection
2019-05-02T16:08:45.070Z [DEBUG] plugin: plugin process exited: path=/usr/local/bin/nomad
2019-05-02T16:08:45.139Z [DEBUG] plugin.nomad: 2019/05/02 16:08:45 [ERR] plugin: plugin server: accept unix /tmp/plugin187726123: use of closed network connection
2019-05-02T16:08:45.141Z [DEBUG] plugin: plugin process exited: path=/usr/local/bin/nomad
    2019/05/02 16:08:45.830264 [INFO] driver.docker: stopped container c163ced0427379d26fda9145ea5f9595a0a710116bd37fb9bdb8d05485d7cd31

So it seems like nomad does have some info about the task getting OOM killed. However, it is then restarted and then kinda just abandoned. I wonder if this causing more OOM kills on our system because then Nomad thinks it can pack in more tasks on this instance?

       Advertise Addrs: HTTP: <redacted>:4646
            Bind Addrs: HTTP: 0.0.0.0:4646
                Client: true
             Log Level: INFO
                Region: global (DC: dc1)
                Server: false
               Version: 0.8.6

@yahor-filipchyk
Copy link

yahor-filipchyk commented May 8, 2019

We are running into an issue where nomad can't stop an allocation. Attaching the agent logs. Note how it is retrying to kill allocations a0fe089e and a8b47cce. I killed a0fe089e manually so you can see it is marked for GC in the end. This all happened when migrating allocations to a new node. For the a0fe089e the last status included

Desired Status       = stop
Desired Description  = alloc is being updated due to job update

Nomad version: Nomad v0.9.1 (4b2bdbd9ab68a27b10c2ee781cceaaf62e114399)

> nomad agent-info
client
  heartbeat_ttl = 19.002270362s
  known_servers = 10.2.1.188:4647,10.2.2.12:4647,10.2.3.243:4647
  last_heartbeat = 13.185020525s
  node_id = be46fd28-4fe9-04b2-46d2-eaeb8ec69a81
  num_allocations = 20
runtime
  arch = amd64
  cpu_count = 4
  goroutines = 735
  kernel.name = linux
  max_procs = 4
  version = go1.11.6

Agent logs:

==> Loaded configuration from /opt/nomad/config/datadog.hcl, /opt/nomad/config/default.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 10.2.2.87:4646
            Bind Addrs: HTTP: 0.0.0.0:4646
                Client: true
             Log Level: INFO
                Region: us-east-1 (DC: us-east-1b)
                Server: false
               Version: 0.9.1

==> Nomad agent started! Log data will stream in below:

    2019-05-08T20:21:16.521Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/data/plugins
    2019-05-08T20:21:16.576Z [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2019-05-08T20:21:16.576Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
    2019-05-08T20:21:16.576Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2019-05-08T20:21:16.576Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2019-05-08T20:21:16.576Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2019-05-08T20:21:16.576Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2019-05-08T20:21:16.576Z [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
    2019-05-08T20:21:16.577Z [INFO ] client: using state directory: state_dir=/opt/nomad/data/client
    2019-05-08T20:21:16.587Z [INFO ] client: using alloc directory: alloc_dir=/opt/nomad/data/alloc
    2019-05-08T20:21:16.597Z [INFO ] client.fingerprint_mgr.cgroup: cgroups are available
    2019-05-08T20:21:16.628Z [WARN ] client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=ens5
    2019-05-08T20:21:16.640Z [INFO ] client.plugin: starting plugin manager: plugin-type=driver
    2019-05-08T20:21:16.640Z [INFO ] client.plugin: starting plugin manager: plugin-type=device
    2019-05-08T20:21:16.640Z [ERROR] client: error discovering nomad servers: error="client.consul: unable to query Consul datacenters: Get http://127.0.0.1:8500/v1/catalog/datacenters: dial tcp 127.0.0.1:8500: connect: connection refused"
    2019-05-08T20:21:19.951Z [INFO ] client: started client: node_id=be46fd28-4fe9-04b2-46d2-eaeb8ec69a81
    2019-05-08T20:21:19.951Z [WARN ] client.server_mgr: no servers available
    2019-05-08T20:21:19.951Z [WARN ] client.server_mgr: no servers available
    2019-05-08T20:21:19.957Z [INFO ] client.consul: discovered following servers: servers=10.2.2.12:4647,10.2.1.188:4647,10.2.3.243:4647
    2019-05-08T20:21:19.977Z [INFO ] client: node registration complete
    2019-05-08T20:21:26.225Z [INFO ] client: node registration complete
    2019-05-08T20:21:26.305Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=c7cf566f-cb80-bef1-84ae-4eb900b4fc9a task=nginx @module=logmon path=/opt/nomad/data/alloc/c7cf566f-cb80-bef1-84ae-4eb900b4fc9a/alloc/logs/.nginx.stdout.fifo timestamp=2019-05-08T20:21:26.304Z
    2019-05-08T20:21:26.305Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=c7cf566f-cb80-bef1-84ae-4eb900b4fc9a task=nginx @module=logmon path=/opt/nomad/data/alloc/c7cf566f-cb80-bef1-84ae-4eb900b4fc9a/alloc/logs/.nginx.stderr.fifo timestamp=2019-05-08T20:21:26.305Z
    2019-05-08T20:21:26.317Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=ceb8becf-9023-2ed0-9994-043480092659 task=proxy @module=logmon path=/opt/nomad/data/alloc/ceb8becf-9023-2ed0-9994-043480092659/alloc/logs/.proxy.stdout.fifo timestamp=2019-05-08T20:21:26.317Z
    2019-05-08T20:21:26.317Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=ceb8becf-9023-2ed0-9994-043480092659 task=proxy @module=logmon path=/opt/nomad/data/alloc/ceb8becf-9023-2ed0-9994-043480092659/alloc/logs/.proxy.stderr.fifo timestamp=2019-05-08T20:21:26.317Z
    2019/05/08 20:21:27.546231 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:21:27.577170 [INFO] (runner) creating watcher
    2019/05/08 20:21:27.577294 [INFO] (runner) starting
    2019-05-08T20:21:27.631Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=ceb8becf-9023-2ed0-9994-043480092659 task=proxy error="failed to decode driver config: [pos 40]: readContainerLen: Unrecognized descriptor byte: hex: d4, decimal: 212"
    2019-05-08T20:21:27.674Z [INFO ] client.alloc_runner.task_runner: not restarting task: alloc_id=ceb8becf-9023-2ed0-9994-043480092659 task=proxy reason="Error was unrecoverable"
    2019-05-08T20:21:27.726Z [INFO ] client.gc: marking allocation for GC: alloc_id=ceb8becf-9023-2ed0-9994-043480092659
    2019/05/08 20:21:29.183155 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/c7cf566f-cb80-bef1-84ae-4eb900b4fc9a/nginx/nginx/nginx.conf"
    2019-05-08T20:21:31.715Z [INFO ] client.fingerprint_mgr.consul: consul agent is available
    2019-05-08T20:21:31.744Z [WARN ] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=ceb8becf-9023-2ed0-9994-043480092659 task=proxy @module=logmon timestamp=2019-05-08T20:21:31.727Z
    2019-05-08T20:21:31.749Z [WARN ] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=ceb8becf-9023-2ed0-9994-043480092659 task=proxy @module=logmon timestamp=2019-05-08T20:21:31.727Z
    2019-05-08T20:21:32.582Z [ERROR] client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=ceb8becf-9023-2ed0-9994-043480092659 task=proxy error="read |0: file already closed"
    2019-05-08T20:21:40.984Z [INFO ] client: node registration complete
    2019-05-08T20:21:54.351Z [INFO ] client.driver_mgr.docker: created container: driver=docker container_id=2cf3c19e189f8b82af2e6c255a2e1ca48d83df9b2af7b218e734ad3fc9b52197
    2019-05-08T20:21:57.388Z [INFO ] client.driver_mgr.docker: started container: driver=docker container_id=2cf3c19e189f8b82af2e6c255a2e1ca48d83df9b2af7b218e734ad3fc9b52197
    2019-05-08T20:22:02.866Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a2de0c6d-d859-1954-66e9-742271507859 task=proxy @module=logmon path=/opt/nomad/data/alloc/a2de0c6d-d859-1954-66e9-742271507859/alloc/logs/.proxy.stdout.fifo timestamp=2019-05-08T20:22:02.864Z
    2019-05-08T20:22:02.870Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a2de0c6d-d859-1954-66e9-742271507859 task=proxy @module=logmon path=/opt/nomad/data/alloc/a2de0c6d-d859-1954-66e9-742271507859/alloc/logs/.proxy.stderr.fifo timestamp=2019-05-08T20:22:02.868Z
    2019-05-08T20:22:55.174Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=64cad3d0-d929-e819-2481-4938985c63af task=akka @module=logmon path=/opt/nomad/data/alloc/64cad3d0-d929-e819-2481-4938985c63af/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:22:55.173Z
    2019-05-08T20:22:55.174Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=64cad3d0-d929-e819-2481-4938985c63af task=akka @module=logmon path=/opt/nomad/data/alloc/64cad3d0-d929-e819-2481-4938985c63af/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:22:55.173Z
    2019-05-08T20:22:55.298Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=91697527-6222-2a91-d3dd-50e24c72c149 task=akka @module=logmon path=/opt/nomad/data/alloc/91697527-6222-2a91-d3dd-50e24c72c149/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:22:55.298Z
    2019-05-08T20:22:55.298Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=91697527-6222-2a91-d3dd-50e24c72c149 task=akka @module=logmon path=/opt/nomad/data/alloc/91697527-6222-2a91-d3dd-50e24c72c149/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:22:55.298Z
    2019-05-08T20:22:55.338Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a94a2301-1393-da6d-9007-0e592ec9d833 task=akka @module=logmon path=/opt/nomad/data/alloc/a94a2301-1393-da6d-9007-0e592ec9d833/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:22:55.337Z
    2019-05-08T20:22:55.338Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a94a2301-1393-da6d-9007-0e592ec9d833 task=akka @module=logmon path=/opt/nomad/data/alloc/a94a2301-1393-da6d-9007-0e592ec9d833/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:22:55.337Z
    2019-05-08T20:22:55.369Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=d9f7fc0e-ca25-e57e-e7e9-d5a3a2daf253 task=akka @module=logmon path=/opt/nomad/data/alloc/d9f7fc0e-ca25-e57e-e7e9-d5a3a2daf253/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:22:55.369Z
    2019-05-08T20:22:55.369Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=01276ea3-b622-c867-063c-01c642173f12 task=akka @module=logmon path=/opt/nomad/data/alloc/01276ea3-b622-c867-063c-01c642173f12/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:22:55.369Z
    2019-05-08T20:22:55.370Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=01276ea3-b622-c867-063c-01c642173f12 task=akka @module=logmon path=/opt/nomad/data/alloc/01276ea3-b622-c867-063c-01c642173f12/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:22:55.369Z
    2019-05-08T20:22:55.370Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=d9f7fc0e-ca25-e57e-e7e9-d5a3a2daf253 task=akka @module=logmon path=/opt/nomad/data/alloc/d9f7fc0e-ca25-e57e-e7e9-d5a3a2daf253/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:22:55.369Z
    2019-05-08T20:22:55.410Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=ffdadeb5-6772-2bbe-8714-77ba1a655ea3 task=akka @module=logmon path=/opt/nomad/data/alloc/ffdadeb5-6772-2bbe-8714-77ba1a655ea3/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:22:55.410Z
    2019-05-08T20:22:55.411Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=ffdadeb5-6772-2bbe-8714-77ba1a655ea3 task=akka path=/opt/nomad/data/alloc/ffdadeb5-6772-2bbe-8714-77ba1a655ea3/alloc/logs/.akka.stderr.fifo @module=logmon timestamp=2019-05-08T20:22:55.410Z
    2019-05-08T20:22:55.430Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2974cadb-723a-2e10-0ea2-94d6a92705be task=akka @module=logmon path=/opt/nomad/data/alloc/2974cadb-723a-2e10-0ea2-94d6a92705be/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:22:55.429Z
    2019-05-08T20:22:55.430Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2974cadb-723a-2e10-0ea2-94d6a92705be task=akka @module=logmon path=/opt/nomad/data/alloc/2974cadb-723a-2e10-0ea2-94d6a92705be/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:22:55.429Z
    2019-05-08T20:22:55.446Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2e0a65c4-d6a8-16e5-43b9-ae1cfd5e660e task=akka @module=logmon path=/opt/nomad/data/alloc/2e0a65c4-d6a8-16e5-43b9-ae1cfd5e660e/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:22:55.444Z
    2019-05-08T20:22:55.446Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2e0a65c4-d6a8-16e5-43b9-ae1cfd5e660e task=akka @module=logmon path=/opt/nomad/data/alloc/2e0a65c4-d6a8-16e5-43b9-ae1cfd5e660e/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:22:55.444Z
    2019-05-08T20:22:55.449Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2641cdf4-3d08-a7e8-6ce8-8e0302ff3556 task=akka @module=logmon path=/opt/nomad/data/alloc/2641cdf4-3d08-a7e8-6ce8-8e0302ff3556/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:22:55.449Z
    2019-05-08T20:22:55.450Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2641cdf4-3d08-a7e8-6ce8-8e0302ff3556 task=akka path=/opt/nomad/data/alloc/2641cdf4-3d08-a7e8-6ce8-8e0302ff3556/alloc/logs/.akka.stderr.fifo @module=logmon timestamp=2019-05-08T20:22:55.449Z
    2019-05-08T20:22:55.458Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=011a28e1-be2f-9b2d-e423-a2db30f3dc02 task=akka @module=logmon path=/opt/nomad/data/alloc/011a28e1-be2f-9b2d-e423-a2db30f3dc02/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:22:55.458Z
    2019-05-08T20:22:55.459Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=011a28e1-be2f-9b2d-e423-a2db30f3dc02 task=akka @module=logmon path=/opt/nomad/data/alloc/011a28e1-be2f-9b2d-e423-a2db30f3dc02/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:22:55.459Z
    2019-05-08T20:22:55.499Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=85a4f7db-6b75-724d-44df-e46ffbc71ec2 task=akka path=/opt/nomad/data/alloc/85a4f7db-6b75-724d-44df-e46ffbc71ec2/alloc/logs/.akka.stdout.fifo @module=logmon timestamp=2019-05-08T20:22:55.498Z
    2019-05-08T20:22:55.499Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=85a4f7db-6b75-724d-44df-e46ffbc71ec2 task=akka @module=logmon path=/opt/nomad/data/alloc/85a4f7db-6b75-724d-44df-e46ffbc71ec2/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:22:55.498Z
    2019-05-08T20:22:57.729Z [INFO ] client.driver_mgr.docker: created container: driver=docker container_id=5a021549d8124c85eadae5a81a70babd26134109e01ced486e07d04557c1af2c
    2019-05-08T20:22:59.033Z [INFO ] client.driver_mgr.docker: started container: driver=docker container_id=5a021549d8124c85eadae5a81a70babd26134109e01ced486e07d04557c1af2c
    2019/05/08 20:23:05.902219 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:05.902957 [INFO] (runner) creating watcher
    2019/05/08 20:23:05.903557 [INFO] (runner) starting
    2019/05/08 20:23:05.904080 [INFO] (runner) rendered "/opt/nomad/data/alloc/ffdadeb5-6772-2bbe-8714-77ba1a655ea3/akka/local/config/config-iter.conf" => "/opt/nomad/data/alloc/ffdadeb5-6772-2bbe-8714-77ba1a655ea3/akka/secrets/vars.env"
    2019-05-08T20:23:05.907Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/preference-1.0-SNAPSHOT/bin/preference Args:[-Dconfig.resource=preference-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:23:06.080067 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:06.080589 [INFO] (runner) creating watcher
    2019/05/08 20:23:06.080976 [INFO] (runner) starting
    2019/05/08 20:23:06.084214 [INFO] (runner) rendered "/opt/nomad/data/alloc/2641cdf4-3d08-a7e8-6ce8-8e0302ff3556/akka/local/config/config-iter.conf" => "/opt/nomad/data/alloc/2641cdf4-3d08-a7e8-6ce8-8e0302ff3556/akka/secrets/vars.env"
    2019-05-08T20:23:06.117Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/library-1.0-SNAPSHOT/bin/library Args:[-Dconfig.resource=library-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:23:06.119513 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:06.120093 [INFO] (runner) creating watcher
    2019/05/08 20:23:06.120534 [INFO] (runner) starting
    2019/05/08 20:23:06.121373 [INFO] (runner) rendered "/opt/nomad/data/alloc/011a28e1-be2f-9b2d-e423-a2db30f3dc02/akka/local/config/config-iter.conf" => "/opt/nomad/data/alloc/011a28e1-be2f-9b2d-e423-a2db30f3dc02/akka/secrets/vars.env"
    2019/05/08 20:23:06.154832 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:06.154974 [INFO] (runner) creating watcher
    2019/05/08 20:23:06.155125 [INFO] (runner) starting
    2019/05/08 20:23:06.155376 [INFO] (runner) rendered "/opt/nomad/data/alloc/64cad3d0-d929-e819-2481-4938985c63af/akka/local/config/config-staging.conf" => "/opt/nomad/data/alloc/64cad3d0-d929-e819-2481-4938985c63af/akka/secrets/vars.env"
    2019/05/08 20:23:06.160644 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:06.160786 [INFO] (runner) creating watcher
    2019/05/08 20:23:06.160966 [INFO] (runner) starting
    2019/05/08 20:23:06.161185 [INFO] (runner) rendered "/opt/nomad/data/alloc/91697527-6222-2a91-d3dd-50e24c72c149/akka/local/config/config-iter.conf" => "/opt/nomad/data/alloc/91697527-6222-2a91-d3dd-50e24c72c149/akka/secrets/vars.env"
    2019-05-08T20:23:06.163Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/comments-1.0-SNAPSHOT/bin/comments Args:[-Dconfig.resource=comments-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019-05-08T20:23:06.179Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/library-1.0-SNAPSHOT/bin/library Args:[-Dconfig.resource=library-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019-05-08T20:23:06.206Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/notification-1.0-SNAPSHOT/bin/notification Args:[-Dconfig.resource=notification-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:23:06.284298 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:06.284454 [INFO] (runner) creating watcher
    2019/05/08 20:23:06.284628 [INFO] (runner) starting
    2019/05/08 20:23:06.285020 [INFO] (runner) rendered "/opt/nomad/data/alloc/a94a2301-1393-da6d-9007-0e592ec9d833/akka/local/config/config-hotfix.conf" => "/opt/nomad/data/alloc/a94a2301-1393-da6d-9007-0e592ec9d833/akka/secrets/vars.env"
    2019-05-08T20:23:06.313Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/challenges-1.0-SNAPSHOT/bin/challenges Args:[-Dconfig.resource=challenges-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:23:06.344494 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:06.344648 [INFO] (runner) creating watcher
    2019/05/08 20:23:06.354038 [INFO] (runner) starting
    2019/05/08 20:23:06.354331 [INFO] (runner) rendered "/opt/nomad/data/alloc/2974cadb-723a-2e10-0ea2-94d6a92705be/akka/local/config/config-iter.conf" => "/opt/nomad/data/alloc/2974cadb-723a-2e10-0ea2-94d6a92705be/akka/secrets/vars.env"
    2019/05/08 20:23:06.354969 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:06.355110 [INFO] (runner) creating watcher
    2019/05/08 20:23:06.359951 [INFO] (runner) starting
    2019/05/08 20:23:06.380609 [INFO] (runner) rendered "/opt/nomad/data/alloc/85a4f7db-6b75-724d-44df-e46ffbc71ec2/akka/local/config/config-staging.conf" => "/opt/nomad/data/alloc/85a4f7db-6b75-724d-44df-e46ffbc71ec2/akka/secrets/vars.env"
    2019-05-08T20:23:06.387Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/framework-1.0-SNAPSHOT/bin/framework Args:[-Dconfig.resource=framework-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019-05-08T20:23:06.397Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/framework-1.0-SNAPSHOT/bin/framework Args:[-Dconfig.resource=framework-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:23:07.039538 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:07.039683 [INFO] (runner) creating watcher
    2019/05/08 20:23:07.039840 [INFO] (runner) starting
    2019/05/08 20:23:07.040069 [INFO] (runner) rendered "/opt/nomad/data/alloc/01276ea3-b622-c867-063c-01c642173f12/akka/local/config/config-iter.conf" => "/opt/nomad/data/alloc/01276ea3-b622-c867-063c-01c642173f12/akka/secrets/vars.env"
    2019-05-08T20:23:07.045Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/reflect-1.0-SNAPSHOT/bin/reflect Args:[-Dconfig.resource=reflect-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms950M -J-Xmx950M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:23:07.047070 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:07.047175 [INFO] (runner) creating watcher
    2019/05/08 20:23:07.047300 [INFO] (runner) starting
    2019/05/08 20:23:07.047526 [INFO] (runner) rendered "/opt/nomad/data/alloc/d9f7fc0e-ca25-e57e-e7e9-d5a3a2daf253/akka/local/config/config-staging.conf" => "/opt/nomad/data/alloc/d9f7fc0e-ca25-e57e-e7e9-d5a3a2daf253/akka/secrets/vars.env"
    2019/05/08 20:23:07.048515 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:23:07.048642 [INFO] (runner) creating watcher
    2019/05/08 20:23:07.048777 [INFO] (runner) starting
    2019/05/08 20:23:07.048978 [INFO] (runner) rendered "/opt/nomad/data/alloc/2e0a65c4-d6a8-16e5-43b9-ae1cfd5e660e/akka/local/config/config-iter.conf" => "/opt/nomad/data/alloc/2e0a65c4-d6a8-16e5-43b9-ae1cfd5e660e/akka/secrets/vars.env"
    2019-05-08T20:23:07.050Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/harmonizer-1.0-SNAPSHOT/bin/harmonizer Args:[-Dconfig.resource=harmonizer-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms800M -J-Xmx800M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019-05-08T20:23:07.054Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/harmonizer-1.0-SNAPSHOT/bin/harmonizer Args:[-Dconfig.resource=harmonizer-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms800M -J-Xmx800M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019-05-08T20:24:51.083Z [ERROR] client.driver_mgr.exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=2974cadb-723a-2e10-0ea2-94d6a92705be driver=exec task_name=akka error="rpc error: code = Unavailable desc = transport is closing"
    2019-05-08T20:24:51.083Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=2974cadb-723a-2e10-0ea2-94d6a92705be task=akka backoff=5s error="executor Shutdown failed: rpc error: code = Unavailable desc = transport is closing"
    2019-05-08T20:24:51.085Z [ERROR] client.driver_mgr.exec.executor.nomad: reading plugin stderr: alloc_id=2974cadb-723a-2e10-0ea2-94d6a92705be driver=exec task_name=akka error="read |0: file already closed"
    2019-05-08T20:24:51.085Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=2974cadb-723a-2e10-0ea2-94d6a92705be task=akka reason="Restart within policy" delay=17.003319503s
    2019-05-08T20:24:56.085Z [WARN ] client.alloc_runner.task_runner: couldn't find task to kill: alloc_id=2974cadb-723a-2e10-0ea2-94d6a92705be task=akka task_id=2974cadb-723a-2e10-0ea2-94d6a92705be/akka/7aeb3806
    2019-05-08T20:24:57.187Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=e0742c4a-bde7-28cd-b1d0-93e2879d2307 task=nginx @module=logmon path=/opt/nomad/data/alloc/e0742c4a-bde7-28cd-b1d0-93e2879d2307/alloc/logs/.nginx.stdout.fifo timestamp=2019-05-08T20:24:57.187Z
    2019-05-08T20:24:57.187Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=e0742c4a-bde7-28cd-b1d0-93e2879d2307 task=nginx @module=logmon path=/opt/nomad/data/alloc/e0742c4a-bde7-28cd-b1d0-93e2879d2307/alloc/logs/.nginx.stderr.fifo timestamp=2019-05-08T20:24:57.187Z
    2019/05/08 20:24:57.319364 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:24:57.319542 [INFO] (runner) creating watcher
    2019/05/08 20:24:57.319640 [INFO] (runner) starting
    2019/05/08 20:24:57.339829 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/e0742c4a-bde7-28cd-b1d0-93e2879d2307/nginx/nginx/nginx.conf"
    2019-05-08T20:24:57.461Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f06d5440-820c-efb5-bf31-3f196ca8d95f task=nginx @module=logmon path=/opt/nomad/data/alloc/f06d5440-820c-efb5-bf31-3f196ca8d95f/alloc/logs/.nginx.stdout.fifo timestamp=2019-05-08T20:24:57.461Z
    2019-05-08T20:24:57.461Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=f06d5440-820c-efb5-bf31-3f196ca8d95f task=nginx @module=logmon path=/opt/nomad/data/alloc/f06d5440-820c-efb5-bf31-3f196ca8d95f/alloc/logs/.nginx.stderr.fifo timestamp=2019-05-08T20:24:57.461Z
    2019/05/08 20:24:57.493891 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:24:57.494321 [INFO] (runner) creating watcher
    2019/05/08 20:24:57.494421 [INFO] (runner) starting
    2019-05-08T20:24:57.503Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=25c1469b-6c2d-a330-51c6-35ec0b0d02ff task=nginx @module=logmon path=/opt/nomad/data/alloc/25c1469b-6c2d-a330-51c6-35ec0b0d02ff/alloc/logs/.nginx.stdout.fifo timestamp=2019-05-08T20:24:57.503Z
    2019-05-08T20:24:57.503Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=25c1469b-6c2d-a330-51c6-35ec0b0d02ff task=nginx path=/opt/nomad/data/alloc/25c1469b-6c2d-a330-51c6-35ec0b0d02ff/alloc/logs/.nginx.stderr.fifo @module=logmon timestamp=2019-05-08T20:24:57.503Z
    2019/05/08 20:24:57.518003 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/f06d5440-820c-efb5-bf31-3f196ca8d95f/nginx/nginx/nginx.conf"
    2019/05/08 20:24:57.559240 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:24:57.559405 [INFO] (runner) creating watcher
    2019/05/08 20:24:57.559490 [INFO] (runner) starting
    2019/05/08 20:24:57.662595 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/25c1469b-6c2d-a330-51c6-35ec0b0d02ff/nginx/nginx/nginx.conf"
    2019-05-08T20:24:57.683Z [INFO ] client.driver_mgr.docker: created container: driver=docker container_id=3f9892f3a41f00f3197744461a90bcc0292fc2ebeadc4a9fd0ee859af096a4be
    2019-05-08T20:24:57.843Z [INFO ] client.driver_mgr.docker: created container: driver=docker container_id=40369af583f415894dd17ef6266191de11ac1469cf23760f1acc12ca16bb2e0d
    2019-05-08T20:24:57.930Z [INFO ] client.driver_mgr.docker: created container: driver=docker container_id=4e594ec2c1e91201d4424c4b2f9a524867b810c881b3be781a3d2a99e97937ec
    2019-05-08T20:25:00.571Z [INFO ] client.driver_mgr.docker: started container: driver=docker container_id=3f9892f3a41f00f3197744461a90bcc0292fc2ebeadc4a9fd0ee859af096a4be
    2019-05-08T20:25:00.674Z [INFO ] client.driver_mgr.docker: started container: driver=docker container_id=40369af583f415894dd17ef6266191de11ac1469cf23760f1acc12ca16bb2e0d
    2019-05-08T20:25:00.676Z [INFO ] client.driver_mgr.docker: started container: driver=docker container_id=4e594ec2c1e91201d4424c4b2f9a524867b810c881b3be781a3d2a99e97937ec
    2019-05-08T20:25:08.124Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2974cadb-723a-2e10-0ea2-94d6a92705be task=akka path=/opt/nomad/data/alloc/2974cadb-723a-2e10-0ea2-94d6a92705be/alloc/logs/.akka.stdout.fifo @module=logmon timestamp=2019-05-08T20:25:08.123Z
    2019-05-08T20:25:08.124Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2974cadb-723a-2e10-0ea2-94d6a92705be task=akka @module=logmon path=/opt/nomad/data/alloc/2974cadb-723a-2e10-0ea2-94d6a92705be/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:25:08.123Z
    2019-05-08T20:25:08.159Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/framework-1.0-SNAPSHOT/bin/framework Args:[-Dconfig.resource=framework-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019-05-08T20:25:08.609Z [ERROR] client.driver_mgr.exec.executor.nomad: reading plugin stderr: alloc_id=91697527-6222-2a91-d3dd-50e24c72c149 driver=exec task_name=akka error="read |0: file already closed"
    2019-05-08T20:25:08.609Z [INFO ] client.alloc_runner.task_runner: restarting task: alloc_id=91697527-6222-2a91-d3dd-50e24c72c149 task=akka reason="Restart within policy" delay=16.218444664s
    2019-05-08T20:25:26.453Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=91697527-6222-2a91-d3dd-50e24c72c149 task=akka @module=logmon path=/opt/nomad/data/alloc/91697527-6222-2a91-d3dd-50e24c72c149/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:25:26.419Z
    2019-05-08T20:25:26.453Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=91697527-6222-2a91-d3dd-50e24c72c149 task=akka path=/opt/nomad/data/alloc/91697527-6222-2a91-d3dd-50e24c72c149/alloc/logs/.akka.stderr.fifo @module=logmon timestamp=2019-05-08T20:25:26.422Z
    2019-05-08T20:25:26.470Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/notification-1.0-SNAPSHOT/bin/notification Args:[-Dconfig.resource=notification-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019-05-08T20:25:44.457Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a0fe089e-06d1-bb50-2529-ce901c2483c5 task=play @module=logmon path=/opt/nomad/data/alloc/a0fe089e-06d1-bb50-2529-ce901c2483c5/alloc/logs/.play.stdout.fifo timestamp=2019-05-08T20:25:44.456Z
    2019-05-08T20:25:44.462Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a0fe089e-06d1-bb50-2529-ce901c2483c5 task=play @module=logmon path=/opt/nomad/data/alloc/a0fe089e-06d1-bb50-2529-ce901c2483c5/alloc/logs/.play.stderr.fifo timestamp=2019-05-08T20:25:44.457Z
    2019-05-08T20:25:46.966Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a8b47cce-a1c0-1c60-a186-001c1e1ed144 task=play @module=logmon path=/opt/nomad/data/alloc/a8b47cce-a1c0-1c60-a186-001c1e1ed144/alloc/logs/.play.stdout.fifo timestamp=2019-05-08T20:25:46.966Z
    2019-05-08T20:25:46.966Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a8b47cce-a1c0-1c60-a186-001c1e1ed144 task=play @module=logmon path=/opt/nomad/data/alloc/a8b47cce-a1c0-1c60-a186-001c1e1ed144/alloc/logs/.play.stderr.fifo timestamp=2019-05-08T20:25:46.966Z
    2019-05-08T20:25:47.000Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=93c78408-2bdf-c492-a0eb-1e9fb5feb8d6 task=akka @module=logmon path=/opt/nomad/data/alloc/93c78408-2bdf-c492-a0eb-1e9fb5feb8d6/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:25:46.999Z
    2019-05-08T20:25:47.000Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=93c78408-2bdf-c492-a0eb-1e9fb5feb8d6 task=akka path=/opt/nomad/data/alloc/93c78408-2bdf-c492-a0eb-1e9fb5feb8d6/alloc/logs/.akka.stderr.fifo @module=logmon timestamp=2019-05-08T20:25:47.000Z
    2019-05-08T20:25:47.150Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=be028bb7-f528-003f-7db5-4a54644551f7 task=play @module=logmon path=/opt/nomad/data/alloc/be028bb7-f528-003f-7db5-4a54644551f7/alloc/logs/.play.stdout.fifo timestamp=2019-05-08T20:25:47.150Z
    2019-05-08T20:25:47.150Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=be028bb7-f528-003f-7db5-4a54644551f7 task=play @module=logmon path=/opt/nomad/data/alloc/be028bb7-f528-003f-7db5-4a54644551f7/alloc/logs/.play.stderr.fifo timestamp=2019-05-08T20:25:47.150Z
    2019-05-08T20:25:47.499Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=deb29634-351e-a008-754d-d62db4c2dbc0 task=akka @module=logmon path=/opt/nomad/data/alloc/deb29634-351e-a008-754d-d62db4c2dbc0/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:25:47.499Z
    2019-05-08T20:25:47.499Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=deb29634-351e-a008-754d-d62db4c2dbc0 task=akka path=/opt/nomad/data/alloc/deb29634-351e-a008-754d-d62db4c2dbc0/alloc/logs/.akka.stderr.fifo @module=logmon timestamp=2019-05-08T20:25:47.499Z
    2019/05/08 20:25:54.064444 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:25:54.064602 [INFO] (runner) creating watcher
    2019/05/08 20:25:54.064767 [INFO] (runner) starting
    2019/05/08 20:25:54.065011 [INFO] (runner) rendered "/opt/nomad/data/alloc/deb29634-351e-a008-754d-d62db4c2dbc0/akka/local/config/config-staging.conf" => "/opt/nomad/data/alloc/deb29634-351e-a008-754d-d62db4c2dbc0/akka/secrets/vars.env"
    2019-05-08T20:25:54.124Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/challenges-1.0-SNAPSHOT/bin/challenges Args:[-Dconfig.resource=challenges-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:25:54.156820 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:25:54.156961 [INFO] (runner) creating watcher
    2019/05/08 20:25:54.157124 [INFO] (runner) starting
    2019/05/08 20:25:54.157374 [INFO] (runner) rendered "/opt/nomad/data/alloc/93c78408-2bdf-c492-a0eb-1e9fb5feb8d6/akka/local/config/config-staging.conf" => "/opt/nomad/data/alloc/93c78408-2bdf-c492-a0eb-1e9fb5feb8d6/akka/secrets/vars.env"
    2019-05-08T20:25:54.186Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/challenges-1.0-SNAPSHOT/bin/challenges Args:[-Dconfig.resource=challenges-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms160M -J-Xmx160M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:25:54.765720 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:25:54.765858 [INFO] (runner) creating watcher
    2019/05/08 20:25:54.766830 [INFO] (runner) starting
    2019/05/08 20:25:54.767300 [INFO] (runner) rendered "/opt/nomad/data/alloc/a8b47cce-a1c0-1c60-a186-001c1e1ed144/play/local/config/config-iter.conf" => "/opt/nomad/data/alloc/a8b47cce-a1c0-1c60-a186-001c1e1ed144/play/secrets/vars.env"
    2019-05-08T20:25:54.770Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/api-1.0-SNAPSHOT/bin/api Args:[-Dconfig.resource=api-web.conf -Dlogger.resource=logback-env.xml -Dhttp.port=28676 -J-Xms950M -J-Xmx950M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:25:55.844431 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:25:55.844621 [INFO] (runner) creating watcher
    2019/05/08 20:25:55.844784 [INFO] (runner) starting
    2019/05/08 20:25:55.845049 [INFO] (runner) rendered "/opt/nomad/data/alloc/a0fe089e-06d1-bb50-2529-ce901c2483c5/play/local/config/config-iter.conf" => "/opt/nomad/data/alloc/a0fe089e-06d1-bb50-2529-ce901c2483c5/play/secrets/vars.env"
    2019-05-08T20:25:55.861Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/api-1.0-SNAPSHOT/bin/api Args:[-Dconfig.resource=api-web.conf -Dlogger.resource=logback-env.xml -Dhttp.port=20985 -J-Xms950M -J-Xmx950M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:25:56.672847 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:25:56.672993 [INFO] (runner) creating watcher
    2019/05/08 20:25:56.673163 [INFO] (runner) starting
    2019/05/08 20:25:56.673402 [INFO] (runner) rendered "/opt/nomad/data/alloc/be028bb7-f528-003f-7db5-4a54644551f7/play/local/config/config-staging.conf" => "/opt/nomad/data/alloc/be028bb7-f528-003f-7db5-4a54644551f7/play/secrets/vars.env"
    2019-05-08T20:25:56.679Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/api-1.0-SNAPSHOT/bin/api Args:[-Dconfig.resource=api-web.conf -Dlogger.resource=logback-env.xml -Dhttp.port=28426 -J-Xms950M -J-Xmx950M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019-05-08T20:39:28.851Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=763844fb-20eb-ebc8-fa3f-a10b01b80ea2 task=akka path=/opt/nomad/data/alloc/763844fb-20eb-ebc8-fa3f-a10b01b80ea2/alloc/logs/.akka.stdout.fifo @module=logmon timestamp=2019-05-08T20:39:28.851Z
    2019-05-08T20:39:28.852Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=763844fb-20eb-ebc8-fa3f-a10b01b80ea2 task=akka path=/opt/nomad/data/alloc/763844fb-20eb-ebc8-fa3f-a10b01b80ea2/alloc/logs/.akka.stderr.fifo @module=logmon timestamp=2019-05-08T20:39:28.851Z
    2019-05-08T20:39:29.344Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=37755b7f-02a9-12d8-9b1a-94638958577b task=akka @module=logmon path=/opt/nomad/data/alloc/37755b7f-02a9-12d8-9b1a-94638958577b/alloc/logs/.akka.stdout.fifo timestamp=2019-05-08T20:39:29.344Z
    2019-05-08T20:39:29.348Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=37755b7f-02a9-12d8-9b1a-94638958577b task=akka @module=logmon path=/opt/nomad/data/alloc/37755b7f-02a9-12d8-9b1a-94638958577b/alloc/logs/.akka.stderr.fifo timestamp=2019-05-08T20:39:29.344Z
    2019/05/08 20:39:33.314857 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:39:33.315383 [INFO] (runner) creating watcher
    2019/05/08 20:39:33.315587 [INFO] (runner) starting
    2019/05/08 20:39:33.315832 [INFO] (runner) rendered "/opt/nomad/data/alloc/763844fb-20eb-ebc8-fa3f-a10b01b80ea2/akka/local/config/config-iter.conf" => "/opt/nomad/data/alloc/763844fb-20eb-ebc8-fa3f-a10b01b80ea2/akka/secrets/vars.env"
    2019-05-08T20:39:33.318Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/experiments-1.0-SNAPSHOT/bin/experiments Args:[-Dconfig.resource=experiments-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms256M -J-Xmx256M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019/05/08 20:39:33.494077 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:39:33.494281 [INFO] (runner) creating watcher
    2019/05/08 20:39:33.494449 [INFO] (runner) starting
    2019/05/08 20:39:33.494920 [INFO] (runner) rendered "/opt/nomad/data/alloc/37755b7f-02a9-12d8-9b1a-94638958577b/akka/local/config/config-hotfix.conf" => "/opt/nomad/data/alloc/37755b7f-02a9-12d8-9b1a-94638958577b/akka/secrets/vars.env"
    2019-05-08T20:39:33.507Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/experiments-1.0-SNAPSHOT/bin/experiments Args:[-Dconfig.resource=experiments-web.conf -Dlogback.configurationFile=logback-env.xml -J-Xms256M -J-Xmx256M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019-05-08T20:50:18.620Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2743907c-848c-1d70-2ab1-4e870eaa1ea9 task=nginx @module=logmon path=/opt/nomad/data/alloc/2743907c-848c-1d70-2ab1-4e870eaa1ea9/alloc/logs/.nginx.stdout.fifo timestamp=2019-05-08T20:50:18.620Z
    2019-05-08T20:50:18.620Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2743907c-848c-1d70-2ab1-4e870eaa1ea9 task=nginx path=/opt/nomad/data/alloc/2743907c-848c-1d70-2ab1-4e870eaa1ea9/alloc/logs/.nginx.stderr.fifo @module=logmon timestamp=2019-05-08T20:50:18.620Z
    2019/05/08 20:50:18.627562 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 20:50:18.627782 [INFO] (runner) creating watcher
    2019/05/08 20:50:18.627889 [INFO] (runner) starting
    2019/05/08 20:50:18.637148 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/2743907c-848c-1d70-2ab1-4e870eaa1ea9/nginx/nginx/nginx.conf"
    2019-05-08T20:50:18.781Z [INFO ] client.driver_mgr.docker: created container: driver=docker container_id=aca08f3117ae392f0a3d188ab8c8dbc320cd01f144208e836a11cdb56228c79e
    2019-05-08T20:50:20.463Z [INFO ] client.driver_mgr.docker: started container: driver=docker container_id=aca08f3117ae392f0a3d188ab8c8dbc320cd01f144208e836a11cdb56228c79e
    2019-05-08T20:50:24.843Z [ERROR] client.driver_mgr.exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=a94a2301-1393-da6d-9007-0e592ec9d833 driver=exec task_name=akka error="rpc error: code = Unavailable desc = transport is closing"
    2019-05-08T20:57:54.119Z [INFO ] client.gc: marking allocation for GC: alloc_id=2743907c-848c-1d70-2ab1-4e870eaa1ea9
    2019-05-08T20:57:54.125Z [ERROR] client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=2743907c-848c-1d70-2ab1-4e870eaa1ea9 task=nginx error="read |0: file already closed"
    2019/05/08 20:57:54.125787 [INFO] (runner) stopping
    2019/05/08 20:57:54.125841 [INFO] (runner) received finish
    2019-05-08T20:59:58.779Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=fddf0feb-28b5-cf58-8318-9224a5f35880 task=play @module=logmon path=/opt/nomad/data/alloc/fddf0feb-28b5-cf58-8318-9224a5f35880/alloc/logs/.play.stdout.fifo timestamp=2019-05-08T20:59:58.779Z
    2019-05-08T20:59:58.779Z [INFO ] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=fddf0feb-28b5-cf58-8318-9224a5f35880 task=play @module=logmon path=/opt/nomad/data/alloc/fddf0feb-28b5-cf58-8318-9224a5f35880/alloc/logs/.play.stderr.fifo timestamp=2019-05-08T20:59:58.779Z
    2019/05/08 21:00:03.122023 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:00:03.122222 [INFO] (runner) creating watcher
    2019/05/08 21:00:03.122481 [INFO] (runner) starting
    2019/05/08 21:00:03.122769 [INFO] (runner) rendered "/opt/nomad/data/alloc/fddf0feb-28b5-cf58-8318-9224a5f35880/play/local/config/config-iter.conf" => "/opt/nomad/data/alloc/fddf0feb-28b5-cf58-8318-9224a5f35880/play/secrets/vars.env"
    2019-05-08T21:00:03.138Z [INFO ] client.driver_mgr.exec: starting task: driver=exec driver_cfg="{Command:local/api-1.0-SNAPSHOT/bin/api Args:[-Dconfig.resource=api-web.conf -Dlogger.resource=logback-env.xml -Dhttp.port=21066 -J-Xms950M -J-Xmx950M -J-Xss512k -J-XX:+UseCompressedOops -J-XX:+HeapDumpOnOutOfMemoryError -J-XX:+UseG1GC -J-XX:MaxGCPauseMillis=400 -J-XX:InitiatingHeapOccupancyPercent=70 -J-XX:SurvivorRatio=6 -J-XX:G1ReservePercent=10 -J-XX:G1HeapRegionSize=1m -J-XX:ConcGCThreads=2]}"
    2019-05-08T21:22:28.198Z [INFO ] client.gc: garbage collecting allocation: alloc_id=ceb8becf-9023-2ed0-9994-043480092659 reason="forced collection"
==> Caught signal: interrupt
    2019-05-08T21:23:08.581Z [INFO ] agent: requesting shutdown
    2019-05-08T21:23:08.581Z [INFO ] client: shutting down
==> Loaded configuration from /opt/nomad/config/datadog.hcl, /opt/nomad/config/default.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

       Advertise Addrs: HTTP: 10.2.2.87:4646
            Bind Addrs: HTTP: 0.0.0.0:4646
                Client: true
             Log Level: INFO
                Region: us-east-1 (DC: us-east-1b)
                Server: false
               Version: 0.9.1

==> Nomad agent started! Log data will stream in below:

    2019-05-08T21:23:18.698Z [WARN ] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/data/plugins
    2019-05-08T21:23:18.703Z [INFO ] agent: detected plugin: name=rkt type=driver plugin_version=0.1.0
    2019-05-08T21:23:18.703Z [INFO ] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
    2019-05-08T21:23:18.703Z [INFO ] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
    2019-05-08T21:23:18.703Z [INFO ] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
    2019-05-08T21:23:18.703Z [INFO ] agent: detected plugin: name=java type=driver plugin_version=0.1.0
    2019-05-08T21:23:18.703Z [INFO ] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
    2019-05-08T21:23:18.703Z [INFO ] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
    2019-05-08T21:23:18.703Z [INFO ] client: using state directory: state_dir=/opt/nomad/data/client
    2019-05-08T21:23:18.703Z [INFO ] client: using alloc directory: alloc_dir=/opt/nomad/data/alloc
    2019-05-08T21:23:18.725Z [INFO ] client.fingerprint_mgr.cgroup: cgroups are available
    2019-05-08T21:23:18.729Z [INFO ] client.fingerprint_mgr.consul: consul agent is available
    2019-05-08T21:23:18.769Z [WARN ] client.fingerprint_mgr.network: unable to parse speed: path=/sbin/ethtool device=ens5
    2019-05-08T21:23:18.805Z [INFO ] client.plugin: starting plugin manager: plugin-type=driver
    2019-05-08T21:23:18.805Z [INFO ] client.plugin: starting plugin manager: plugin-type=device
    2019-05-08T21:23:18.819Z [INFO ] client.consul: discovered following servers: servers=10.2.2.12:4647,10.2.1.188:4647,10.2.3.243:4647
    2019-05-08T21:23:19.023Z [ERROR] client.driver_mgr.exec: failed to reattach to executor: driver=exec error="error creating rpc client for executor plugin: Reattachment process not found" task_id=a94a2301-1393-da6d-9007-0e592ec9d833/akka/d42902ef
    2019-05-08T21:23:19.023Z [ERROR] client.alloc_runner.task_runner: error recovering task; cleaning up: alloc_id=a94a2301-1393-da6d-9007-0e592ec9d833 task=akka error="failed to reattach to executor: error creating rpc client for executor plugin: Reattachment process not found" task_id=a94a2301-1393-da6d-9007-0e592ec9d833/akka/d42902ef
    2019-05-08T21:23:19.055Z [INFO ] client: started client: node_id=be46fd28-4fe9-04b2-46d2-eaeb8ec69a81
    2019-05-08T21:23:19.075Z [INFO ] client.gc: marking allocation for GC: alloc_id=2743907c-848c-1d70-2ab1-4e870eaa1ea9
    2019/05/08 21:23:19.102963 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.104684 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.108261 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.122165 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.122783 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.126296 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.126445 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.126643 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.110191 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.111037 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.129335 [INFO] (runner) creating watcher
    2019-05-08T21:23:19.132Z [INFO ] client.gc: marking allocation for GC: alloc_id=a94a2301-1393-da6d-9007-0e592ec9d833
    2019/05/08 21:23:19.111997 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.132637 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.132692 [INFO] (runner) starting
    2019/05/08 21:23:19.112274 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.132868 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.112306 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.133339 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.114421 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.134172 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.120905 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.134312 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.121119 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.134464 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.121296 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.121630 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.108732 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.123189 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.124755 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.128608 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.134899 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.135332 [INFO] (runner) starting
    2019/05/08 21:23:19.135357 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.135909 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.136082 [INFO] (runner) starting
    2019/05/08 21:23:19.136748 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.137856 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.138765 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.138906 [INFO] (runner) starting
    2019/05/08 21:23:19.139340 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.159482 [INFO] (runner) starting
    2019/05/08 21:23:19.174250 [INFO] (runner) starting
    2019/05/08 21:23:19.179265 [INFO] (runner) starting
    2019/05/08 21:23:19.192049 [INFO] (runner) starting
    2019/05/08 21:23:19.195643 [INFO] (runner) starting
    2019/05/08 21:23:19.196539 [INFO] (runner) starting
    2019/05/08 21:23:19.197529 [INFO] (runner) starting
    2019/05/08 21:23:19.197597 [INFO] (runner) starting
    2019/05/08 21:23:19.199136 [INFO] (runner) starting
    2019/05/08 21:23:19.207156 [INFO] (runner) starting
    2019/05/08 21:23:19.208862 [INFO] (runner) starting
    2019/05/08 21:23:19.219323 [INFO] (runner) starting
    2019/05/08 21:23:19.222134 [INFO] (runner) starting
    2019/05/08 21:23:19.227013 [INFO] (runner) creating new runner (dry: false, once: false)
    2019/05/08 21:23:19.227134 [INFO] (runner) creating watcher
    2019/05/08 21:23:19.230464 [INFO] (runner) starting
    2019/05/08 21:23:19.231020 [INFO] (runner) starting
    2019-05-08T21:23:19.261Z [INFO ] client: node registration complete
    2019-05-08T21:23:20.512Z [INFO ] client.gc: marking allocation for GC: alloc_id=37755b7f-02a9-12d8-9b1a-94638958577b
    2019-05-08T21:23:24.540Z [INFO ] client: node registration complete
    2019-05-08T21:23:39.097Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=a8b47cce-a1c0-1c60-a186-001c1e1ed144 task=play backoff=5s error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:23:39.102Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=a0fe089e-06d1-bb50-2529-ce901c2483c5 task=play backoff=5s error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:23:48.251Z [INFO ] client.gc: marking allocation for GC: alloc_id=f06d5440-820c-efb5-bf31-3f196ca8d95f
    2019/05/08 21:23:49.099958 [INFO] (runner) stopping
    2019/05/08 21:23:49.100040 [INFO] (runner) received finish
    2019-05-08T21:24:04.119Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=a8b47cce-a1c0-1c60-a186-001c1e1ed144 task=play backoff=20s error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:24:04.120Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=a0fe089e-06d1-bb50-2529-ce901c2483c5 task=play backoff=20s error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:24:30.138Z [INFO ] client.gc: marking allocation for GC: alloc_id=25c1469b-6c2d-a330-51c6-35ec0b0d02ff
    2019/05/08 21:24:31.115725 [INFO] (runner) stopping
    2019/05/08 21:24:31.115821 [INFO] (runner) received finish
    2019-05-08T21:24:44.148Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=a8b47cce-a1c0-1c60-a186-001c1e1ed144 task=play backoff=1m20s error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:24:44.148Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=a0fe089e-06d1-bb50-2529-ce901c2483c5 task=play backoff=1m20s error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:26:24.167Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=a0fe089e-06d1-bb50-2529-ce901c2483c5 task=play backoff=2m0s error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:26:24.168Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=a8b47cce-a1c0-1c60-a186-001c1e1ed144 task=play backoff=2m0s error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:28:44.183Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=a0fe089e-06d1-bb50-2529-ce901c2483c5 task=play backoff=2m0s error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:28:44.183Z [ERROR] client.alloc_runner.task_runner: failed to kill task: alloc_id=a8b47cce-a1c0-1c60-a186-001c1e1ed144 task=play backoff=2m0s error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:30:44.183Z [ERROR] client.alloc_runner.task_runner: failed to kill task. Resources may have been leaked: alloc_id=a0fe089e-06d1-bb50-2529-ce901c2483c5 task=play error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T21:30:44.184Z [ERROR] client.alloc_runner.task_runner: failed to kill task. Resources may have been leaked: alloc_id=a8b47cce-a1c0-1c60-a186-001c1e1ed144 task=play error="executor Shutdown failed: rpc error: code = Unknown desc = process failed to exit after 15 seconds"
    2019-05-08T22:37:28.216Z [INFO ] client.gc: garbage collecting allocation: alloc_id=25c1469b-6c2d-a330-51c6-35ec0b0d02ff reason="forced collection"
    2019-05-08T22:37:28.216Z [INFO ] client.gc: garbage collecting allocation: alloc_id=2743907c-848c-1d70-2ab1-4e870eaa1ea9 reason="forced collection"
    2019-05-08T22:37:28.216Z [INFO ] client.gc: garbage collecting allocation: alloc_id=37755b7f-02a9-12d8-9b1a-94638958577b reason="forced collection"
    2019-05-08T22:37:28.216Z [INFO ] client.gc: garbage collecting allocation: alloc_id=f06d5440-820c-efb5-bf31-3f196ca8d95f reason="forced collection"
    2019-05-08T22:37:28.216Z [INFO ] client.gc: garbage collecting allocation: alloc_id=a94a2301-1393-da6d-9007-0e592ec9d833 reason="forced collection"
    2019-05-08T23:18:32.998Z [INFO ] client.gc: marking allocation for GC: alloc_id=a0fe089e-06d1-bb50-2529-ce901c2483c5

@bizonek27
Copy link

Same issue here
Nomad version: Nomad v0.9.1 (4b2bdbd9ab68a27b10c2ee781cceaaf62e114399)
Any other solution without restarting agents?

@urjitbhatia
Copy link

@preetapan this issue bites us very regularly and it seems like a lot of other users are also impacted by this. Any updates on this?

@parberge
Copy link

parberge commented Jul 1, 2019

We have seen this kind of behaviour with docker driver as well.
Running nomad 0.9.3.

We have observed the following:
When nomad set desired state stop on an allocation that is in running we can see the "Sent interrupt" message on the allocation. However, the associated docker container of that allocation is still running and stays running, and the status never changes from running.

When this happens nomad seem to have lost communication with docker on that client. If we stop the container manually (which works without any problems) nomad still think it's running.
Only option we have to solve this state is to restart nomad on the client.

Docker version: 18.09.6

@byronwolfman
Copy link
Contributor

byronwolfman commented Jul 9, 2019

We're observing the exact same behaviour as @parberge . Nomad sets desired state stop but the allocations stay running. I jumped onto a host thinking docker was stuck and docker stop'd a container. No joy. Nomad still thinks it's running, and it even stays registered in the consul catalog.

CentOS 7 - Linux <host> 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Nomad 0.9.3
Docker 18.09.7

Wish I could provide more info but we are rapidly reverting one of our staging clusters to 0.8.x.

@byronwolfman
Copy link
Contributor

Fun fact: it's not possible to downgrade the nomad servers from 0.9.x to 0.8.x so we've just downgraded our nodes for now. Running 0.9.3 servers + 0.8.4 nodes has resolved the unstoppable allocations, which is to say the problem can probably be safely attributed to the client. We've kept Docker 18.09.7 so can rule that out.

@parberge
Copy link

Fun fact: it's not possible to downgrade the nomad servers from 0.9.x to 0.8.x so we've just downgraded our nodes for now. Running 0.9.3 servers + 0.8.4 nodes has resolved the unstoppable allocations, which is to say the problem can probably be safely attributed to the client. We've kept Docker 18.09.7 so can rule that out.

That's really good to know. Might try that as well

@byronwolfman
Copy link
Contributor

Nomad: 0.9.3
Consul: 1.5.1 (also tested with 1.4.5)
CentOS Linux release 7.6.1810 (Core)

Hey team,

We're having a heck of a time safely reproducing this issue. We have a staging cluster that has 0.9.3 servers and 0.8.4 nodes. If I add a couple of 0.9.3 nodes and use job constraints to keep our "real" apps off of them, I can deploy and drain all day without triggering the issue. As soon as I deploy any of our real apps to 0.9.3 though, we will eventually trigger the bug (and then the bug affects real apps and test apps). Specifically what happens:

  1. We will deploy a new version of an app with canary, then promote it, and then the old allocations will get "stuck" in desired: stop but state: running . The old allocations will also not be de-registered from consul.
  2. A far more reliable way to trigger the bug is to just drain a node, since now we're trying to stop lots of allocations instead of just a few; in this case, allocations also become desired: stop with state: running forever

If we terminate the node, the nomad servers will recognize the allocations as "lost" (which is expected).

To reiterate we're experiencing this with the docker driver and not exec.

The only thing I can think of that's different before and after triggering the bug is that "real" apps will sometimes restart as they come up, since some try acquiring locks and don't return healthchecks until they succeed. Nomad dutifully restarts these apps and after a few minutes everything is stable. I don't know if this leaves any lingering state, but we've only observed this bug on nodes where allocations have been restarted due to health checks. It may be coincidental -- it may not.

I managed to capture logs and telemetry from one of our nodes as this happened. They're about 300+ lines long so I'm linking to a gist rather than pasting them inline. I've performed a find-and-replace on the host name and most of our service names since those details are not meant to be out on the internet, but I've left alone the alloc/node IDs since those are just giant UUIDs.

Node drain, which begins at 00:13:50 UTC, and corresponding consul (1.5.1) and nomad (0.9.3) debug logs: https://gist.github.com/byronwolfman/570ab3918c9d378a714598bc45d195ca

Some allocations do actually finish draining, but most do not. That the drain log is shows everything that we see: we never get our prompt back, and never progress past the last message at 00:14:43.

The consul/nomad logs on the other hand have been cropped to the events just as the drain is triggered, and a few minutes after. Nothing interesting appears in the nomad/consul logs after 00:18:57; it just keeps repeating /v1/agent/health and occasionally /v1/node/95b3c3c5... We know that resource exhaustion isn't the problem since we don't have any blocked evaluations, and there is plenty of unallocated resources on other nodes.

We're thinking of compiling 0.9.3 with a few extra log messages, but I'm not sure where the best place to put them would be. If you have some recommendations on where to put them, we can try reproducing again in the evening when breaking a staging cluster doesn't affect our teams.

A final gift: we emit statsd and have telemetry. This is only a sample of what looked like some interesting metrics, but we can produce others if you'd like to see them.

Metrics from the host that was being drained:

Screen Shot 2019-07-12 at 12 13 27 PM

The graph is in EDT but you can see at 20:14 EDT (00:14 UTC) that goroutines essentially flat line. We're not sure if this is meaningful but it's interesting.

@byronwolfman
Copy link
Contributor

byronwolfman commented Jul 12, 2019

Hey sorry for being a nuisance in this issue. We managed to reproduce in a vagrant environment today. Steps to reproduce:

  1. Deploy at least two Nomad jobs; they should be flaky so that they get killed due to bad healthchecks at some point (this works: https://github.com/byronwolfman/nomad-playground/blob/master/bootstrap/webapp/v0.3/flaky_shinatra.sh)
  2. Come back in 30-60 minutes after they've had time to crash a few times
  3. Attempt to canary deploy a new version (can change an env variable value to get a diff)
  4. Promote canary once healthy
  5. Watch everything grind to a halt

The issue seems to be agnostic to exec/docker because the problem is in the taskrunner's service pre-killing service hook. Specifically here:

https://github.com/hashicorp/nomad/blob/v0.9.3/client/allocrunner/taskrunner/service_hook.go#L148-L154

We did the grossest thing possible for a team of sysadmins who don't write go, and just bolted on some more logging statements:

148 func (h *serviceHook) PreKilling(ctx context.Context, req *interfaces.TaskPreKillRequest, resp *interfaces.TaskPreKillResponse) error {
149   h.logger.Info("[ws] PreKilling invoked; top")
150   h.mu.Lock()
151   defer h.mu.Unlock()
152
153   // Deregister before killing task
154   h.logger.Info("[ws] PreKilling invoked; will deregister")
155   h.deregister()
156   h.logger.Info("[ws] PreKilling invoked; deregister returned")

When the node is working normally, we see all three messages emitted: 1. we're at the top of the PreKilling function, 2. now we're deregistering from consul, 3. now we've deregistered.

When things break, we can watch in real-time as we get locked out by the mutex. Have a look at this:

Jul 12 19:56:37 ip-10-0-2-15 nomad: 2019-07-12T19:56:37.379Z [INFO ] client.alloc_runner.task_runner: [ws] running pre kill hooks: alloc_id=d2f4fa82-ee6e-55fd-9f8e-e0bd036f6f36 task=training-wheels
Jul 12 19:56:37 ip-10-0-2-15 nomad: 2019-07-12T19:56:37.379Z [INFO ] client.alloc_runner.task_runner: [ws] running prekill hook: alloc_id=d2f4fa82-ee6e-55fd-9f8e-e0bd036f6f36 task=training-wheels name=consul_services
Jul 12 19:56:37 ip-10-0-2-15 nomad: 2019-07-12T19:56:37.379Z [INFO ] client.alloc_runner.task_runner.task_hook.consul_services: [ws] PreKilling invoked; top: alloc_id=d2f4fa82-ee6e-55fd-9f8e-e0bd036f6f36 task=training-wheels
Jul 12 19:56:37 ip-10-0-2-15 nomad: 2019-07-12T19:56:37.379Z [INFO ] client.alloc_runner.task_runner.task_hook.consul_services: [ws] PreKilling invoked; will deregister: alloc_id=d2f4fa82-ee6e-55fd-9f8e-e0bd036f6f36 task=training-wheels
Jul 12 19:56:37 ip-10-0-2-15 nomad: 2019-07-12T19:56:37.386Z [WARN ] consul.sync: failed to update services in Consul: error="Unexpected response code: 500 (Unknown check "_nomad-check-d6df5be19ff4ecf3aea6b073630c17acc39e9e63")"
Jul 12 19:56:38 ip-10-0-2-15 nomad: 2019-07-12T19:56:38.391Z [INFO ] consul.sync: successfully updated services in Consul

Uh oh. "will deregister" -- but we don't get "deregister returned". That seems bad. And now this happens:

Jul 12 19:59:46 ip-10-0-2-15 nomad: 2019-07-12T19:59:46.140Z [INFO ] client.alloc_runner.task_runner: [ws] running pre kill hooks: alloc_id=d2f4fa82-ee6e-55fd-9f8e-e0bd036f6f36 task=training-wheels
Jul 12 19:59:46 ip-10-0-2-15 nomad: 2019-07-12T19:59:46.140Z [INFO ] client.alloc_runner.task_runner: [ws] running prekill hook: alloc_id=d2f4fa82-ee6e-55fd-9f8e-e0bd036f6f36 task=training-wheels name=consul_services
Jul 12 19:59:46 ip-10-0-2-15 nomad: 2019-07-12T19:59:46.140Z [INFO ] client.alloc_runner.task_runner.task_hook.consul_services: [ws] PreKilling invoked; top: alloc_id=d2f4fa82-ee6e-55fd-9f8e-e0bd036f6f36 task=training-wheels

Well now we don't even get "will deregistered" because the mutex never unlocked. Now we're stuck.

@byronwolfman
Copy link
Contributor

byronwolfman commented Jul 14, 2019

As a preface: I'm nearly 100% certain that this issue is now driver agnostic. It definitely seems to be consul related (or rather, check watcher related) and sure enough even though the original submission is for the exec driver, the jobspec file posted has a service stanza and that's the big deal here.

Also this is probably beyond my ability to solve. I could be looking in the wrong places here, but it looks to me like the biggest change between 0.8.4 (what we're mostly running) and 0.9.x is that allocation restarts due to healthcheck failures are handled somewhat synchronously, and also very slowly if a shutdown delay is used.

I've added another (yes, sorry) logger output to https://github.com/hashicorp/nomad/blob/v0.9.3/command/agent/consul/check_watcher.go#L163 and I can see the check watcher's main watch loop grind to a halt over time. The loop is expected to be stopped when len(checks) is 0 since we block on waiting for updates, but this isn't the issue because I can watch it stop even with len(checks) > 0. We don't exit the loop; we just get bogged down restarting services to the point that we never make it back to the top.

When an allocation becomes unhealthy, we go here: https://github.com/hashicorp/nomad/blob/v0.9.3/command/agent/consul/check_watcher.go#L257

Then here: https://github.com/hashicorp/nomad/blob/v0.9.3/command/agent/consul/check_watcher.go#L63

And then here: https://github.com/hashicorp/nomad/blob/v0.9.3/command/agent/consul/check_watcher.go#L109

In Nomad 0.8.4, Restart is essentially a fire-and-forget where we insert an event into the restart channel. It's a small enough function that here's the whole thing:

func (r *TaskRunner) Restart(source, reason string, failure bool) {
	reasonStr := fmt.Sprintf("%s: %s", source, reason)
	event := newTaskRestartEvent(reasonStr, failure)

	select {
	case r.restartCh <- event:
	case <-r.waitCh:
	}
}

(via https://github.com/hashicorp/nomad/blob/v0.8.4/client/task_runner.go#L1757-L1765)

In 0.9.x (and at least up to the mast branch as of today), restart is far more synchronous: https://github.com/hashicorp/nomad/blob/v0.9.3/client/allocrunner/taskrunner/lifecycle.go#L11

In particular, we run pre-kill hooks, which has some interesting consequences. For example: most of the services that our shop runs have a long shutdown delay to allow for traffic draining. This works fine when we deploy stuff (old allocs are de-registered from consul, drain for 45 seconds, and are then shutdown). In 0.8.4 a restart was just a restart; in 0.9.3 a restart fires the pre-kill hooks which includes a 45 second delay. If we have a bunch of allocations all restarting at once, then the check watcher has to loop over every one of them and kill them one at a time, waiting 45 seconds between each pre-kill hook returning.

Each time we kill something, we throw a message into the check watcher's update channel. Unfortunately the check watcher cannot retrieve these messages yet because it's still busy killing services. Eventually the channel fills up to its max and no more messages can be inserted. It's a blocking call though, so those allocations that were taking "only" 45 seconds to get out of the way are now dead-locked.

At least, that's the working theory.

I'm going to look around to see if there's a non-horrible way to skip the shutdown delay when an allocation is killed due to a bad healthcheck. Maybe it will solve the problem, maybe it won't. It may be that there's a more deeply-rooted problem and up until now, Nomad has previously been able to chew through its restart and update queue quickly enough to hide it.

@byronwolfman
Copy link
Contributor

byronwolfman commented Jul 14, 2019

Well never mind -- turns out the solution is a lot easier. I need to spend some more time actually validating that this solves the problem, but on my crash-loop apparatus I've been able to run the client for 10 minutes now, which is about 9 minutes longer than usual before everything locks up. Here's the main change:

diff --git a/client/allocrunner/taskrunner/lifecycle.go b/client/allocrunner/taskrunner/lifecycle.go
index 404c9596f..b10df92df 100644
--- a/client/allocrunner/taskrunner/lifecycle.go
+++ b/client/allocrunner/taskrunner/lifecycle.go
@@ -22,8 +22,11 @@ func (tr *TaskRunner) Restart(ctx context.Context, event *structs.TaskEvent, fai
        // Emit the event since it may take a long time to kill
        tr.EmitEvent(event)

-       // Run the pre-kill hooks prior to restarting the task
-       tr.preKill()
+       // Run the pre-kill hooks prior to restarting the task, but skip in case of failure
+       // Service deregistration will be done by the check watcher so no need to deregister via pre-kill hooks
+       if !failure {
+               tr.preKill()
+       }

        // Tell the restart tracker that a restart triggered the exit
        tr.restartTracker.SetRestartTriggered(failure)

This is lessy hacky than it looks. The pre-kill hook really only does two things:

  1. Deregister the service from consul, if it is registered
  2. Wait if there's a shutdown delay set

An unhealthy allocation should probably just be restarted right away, so that takes care of the second pre-kill action. The first pre-kill action doesn't actually need to be done at all, since the check watcher already does this on its own:

          restarted := check.apply(ctx, now, result.Status)
          if restarted {
            // Checks are registered+watched on
            // startup, so it's safe to remove them
            // whenever they're restarted
            delete(checks, cid)

            restartedTasks[check.taskKey] = struct{}{}
          }

(Late night edit: this is not the case. The checks are removed, but not the service registration, so we still need to figure out a way to remove the service from consul if we skip the pre-kill hooks.)

I'd be curious to see if the other reporters on this issue are able to apply the patch, and if it resolves the issue for them. I'll have time to do more thorough testing on Monday.

A bonus diff, while we're at it:

diff --git a/client/allocrunner/taskrunner/service_hook.go b/client/allocrunner/taskrunner/service_hook.go
index a89110417..2dc1fa524 100644
--- a/client/allocrunner/taskrunner/service_hook.go
+++ b/client/allocrunner/taskrunner/service_hook.go
@@ -176,12 +176,6 @@ func (h *serviceHook) Exited(context.Context, *interfaces.TaskExitedRequest, *in
 func (h *serviceHook) deregister() {
        taskServices := h.getTaskServices()
        h.consul.RemoveTask(taskServices)
-
-       // Canary flag may be getting flipped when the alloc is being
-       // destroyed, so remove both variations of the service
-       taskServices.Canary = !taskServices.Canary
-       h.consul.RemoveTask(taskServices)
-
 }

 func (h *serviceHook) Stop(ctx context.Context, req *interfaces.TaskStopRequest, resp *interfaces.TaskStopResponse) error {

We don't need to flip the canary tag to check it again, since the canary status isn't used for generating task service IDs:

func makeTaskServiceID(allocID, taskName string, service *structs.Service, canary bool) string {
  return fmt.Sprintf("%s%s-%s-%s", nomadTaskPrefix, allocID, taskName, service.Name)
}

Which is funny, since not including the canary tag solved the last issue I filed! :D (EDIT: I should mention the reason this diff is riding along, is because halving the number of tasks to remove from consul keeps the check watcher's channel clean -- important since the buffer is small.)

But for reals, give the diff a shot. I'll open a PR once I've made sure the fix isn't worse than the bug.

@byronwolfman
Copy link
Contributor

Having had some more time to think about this, I think we're looking at a pretty classic deadlock; the shutdown delay doesn't really factor into the problem cause (other than making it slightly more noticeable) though I still think we should skip it in the event of a healthcheck failure.

A more simplified view of the check watcher's Run() loop would be:

while {
  dequeueUpdates {
    ...
  }

  for all allocs {
    restartUnhealthy {
      enqueueUpdate()
    }
}

The check watcher loops over all checks, and when it finds an unhealthy one, restarts it. The pre-kill hook in that restart wants to enqueue things back into check watcher though, so all you need to deadlock the check watcher is to have five allocations crash. The first four allocation restarts enqueue at least eight updates (2 updates per check, because of canary/non-canary). The fifth allocation restart tries to enqueue its updates but can't because the channel buffer is full. It's acquired a lock along the way too.

Now that the channel is full, and a lock has been acquired, we'll never be able to loop through the remainder of the unhealthy allocations, and because of that, we never make it back to the top of the run loop.

Skipping the shutdown delay would really just help surface the deadlock faster. The only real solution (which, fortunately, is accidentally solved by the same diffs above) is to not enqueue work while inside a loop that might block, and that cannot be dequeued until we exit said loop.

@byronwolfman
Copy link
Contributor

I've a slightly more comprehensive fix up (#5957) but it probably deserves the attention of someone who actually knows golang.

schmichael added a commit that referenced this issue Jul 17, 2019
Fixes #5395
Alternative to #5957

Make task restarting asynchronous when handling check-based restarts.
This matches the pre-0.9 behavior where TaskRunner.Restart was an
asynchronous signal. The check-based restarting code was not designed
to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and
could easily overwhelm the buffered update chan and deadlock.

Many thanks to @byronwolfman for his excellent debugging, PR, and
reproducer!

I created this alternative as changing the functionality of
TaskRunner.Restart has a much larger impact. This approach reverts to
old known-good behavior and minimizes the number of places changes are
made.
@schmichael
Copy link
Member

#5975 fixes the deadlock @byronwolfman discovered and will ship in 0.9.4 which will be in RC stage soon.

Unfortunately @byronwolfman's #5957 PR did not get merged and contained a couple QoL fixes. We wanted to keep 0.9.4 focused on stability improvements so if @byronwolfman or someone else wants to submit PRs for the dropped improvements we'd be more than happy to consider them for 0.10.0! See #5980 for a description of how ShutdownDelay was improved in @byronwolfman's PR.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
10 participants