Tasks using Docker drivers frequently failed to restart #16338

dani · 2023-03-05T20:51:22Z

Nomad version

Nomad v1.5.0
BuildDate 2023-03-01T10:11:42Z
Revision fc40c49

Operating system and Environment details

AlmaLinux 8.7 x86_64
Nomad installed from the pre-built standalone binary
Docker CE 23.0.1 (from the official Docker repo)

Issue

Since I've updated Nomad from 1.4.4 to 1.5.0, I have frequent errors when a task is to be restarted :

It can be triggered by a template re-rendered with the default restart value for on_change, or by manually restarting a task from the GUI. The task is correctly stopped, but when starting it again, it fails.

Reproduction steps

Not sure yet what the exact conditions to reproduce are. It doesn't seem to be 100% of the time. But I can reproduce it easily by restarting from the GUI an connect sidecar.

Expected Result

The task is correctly restarted

Actual Result

The task fails to be started and is marked as failed

Job file (if appropriate)

Still trying to find a simple reproducer (can trigger it easily with internal jobs which I can't share)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Sample logs when the error is triggered. In this example, I restarted the connect proxy sidecar task (connect-proxy-pharma-ws-stub) from the GUI

mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.096+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub type="Restart Signaled" msg="User requested task to restart" failed=false
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.314+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=2a748bc8df048a2dcee5a759369b9802fff0cc3510fab52cd711ab1a66f973ec driver=docker
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.319+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub type=Terminated msg="Exit Code: 0" failed=false
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.331+0100 [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/nomad/bin/nomad pid=3472522
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.338+0100 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub reason="" delay=0s
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.338+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub type=Restarting msg="Task restarting in 0s" failed=false
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.385+0100 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=ca2a21e67d8244ab95f054fa7356465bab9e00e00817b4c1553f667ae57faaf4
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.412+0100 [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=ca2a21e67d8244ab95f054fa7356465bab9e00e00817b4c1553f667ae57faaf4 error="API error (404): No such container: f380f97ae161425077bc0b5621883bef5b9516cab224d388eb77254762aec9c4"
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.441+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub type="Driver Failure" msg="Failed to start container ca2a21e67d8244ab95f054fa7356465bab9e00e00817b4c1553f667ae57faaf4: API error (404): No such container: f380f97ae161425077bc0b5621883bef5b9516cab224d388eb77254762aec9c4" failed=false
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.449+0100 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub error="Failed to start container ca2a21e67d8244ab95f054fa7356465bab9e00e00817b4c1553f667ae57faaf4: API error (404): No such container: f380f97ae161425077bc0b5621883bef5b9516cab224d388eb77254762aec9c4"
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.449+0100 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub reason="Error was unrecoverable"
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.449+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub type="Not Restarting" msg="Error was unrecoverable" failed=true
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.460+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=pharma-ws-stub type="Sibling Task Failed" msg="Task's sibling \"connect-proxy-pharma-ws-stub\" failed" failed=false
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.463+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=pharma-ws-stub type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.749+0100 [INFO]  client.driver_mgr.docker: stopped container: container_id=8cd06fee4108609d9d042e3400608da83cd873c9a9b990afdfbfd28e0e7ad25c driver=docker
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.758+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=pharma-ws-stub type=Terminated msg="Exit Code: 143, Exit Message: \"Docker container exited with non-zero exit code: 143\"" failed=false
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.765+0100 [INFO]  client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/nomad/bin/nomad pid=3472820
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.783+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=pharma-ws-stub type=Killed msg="Task successfully killed" failed=false
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.797+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=pharma-ws-stub path=/opt/nomad/bin/nomad pid=3472531
mars 05 21:32:01 ct-wn-4 nomad[579417]:     2023-03-05T21:32:01.797+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
mars 05 21:32:05 ct-wn-4 nomad[579417]:     2023-03-05T21:32:05.804+0100 [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub @module=logmon timestamp="2023-03-05T21:32:05.804+0100"
mars 05 21:32:05 ct-wn-4 nomad[579417]:     2023-03-05T21:32:05.805+0100 [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub @module=logmon timestamp="2023-03-05T21:32:05.804+0100"
mars 05 21:32:05 ct-wn-4 nomad[579417]:     2023-03-05T21:32:05.813+0100 [INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=connect-proxy-pharma-ws-stub path=/opt/nomad/bin/nomad pid=3472363
mars 05 21:32:05 ct-wn-4 nomad[579417]:     2023-03-05T21:32:05.813+0100 [INFO]  client.alloc_runner: waiting for task to exit: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 task=pharma-ws-stub
mars 05 21:32:05 ct-wn-4 nomad[579417]:     2023-03-05T21:32:05.813+0100 [INFO]  client.gc: marking allocation for GC: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4
mars 05 21:32:05 ct-wn-4 nomad[579417]:     2023-03-05T21:32:05.933+0100 [ERROR] client.alloc_runner: postrun failed: alloc_id=62698c1a-3e79-c080-5c82-902f7eb228d4 error="hook \"network\" failed: No such container: f380f97ae161425077bc0b5621883bef5b9516cab224d388eb77254762aec9c4"

The text was updated successfully, but these errors were encountered:

plasmastorm · 2023-03-05T21:26:46Z

I'm seeing the same error on Nomad 1.4.5 and 1.5.0 on Debian 11.

I've noticed that all init/pause containers are vanishing approximately 10 minutes after they're started, after which point the error occurs on attempted container restart.

I believe these are the containers that are missing in the error message. At that point I also see docker stats (and by extension the Nomad resource utilization page) become zero for any container using bridge network ports.

Rolling back to Nomad 1.4.4 fixes this.

Will see if I can grab something pertinent from the logs when I have more time.

plasmastorm · 2023-03-05T21:54:21Z

Searching the logs for the missing container ID i get:

Mar 05 20:21:03 poweredge dockerd[650]: time="2023-03-05T20:21:03.708811484Z" level=info msg="ignoring event" container=388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Mar 05 20:21:03 poweredge dockerd[650]: time="2023-03-05T20:21:03.716732258Z" level=warning msg="ShouldRestart failed, container will not be restarted" container=388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031 daemonShuttingDown=false error="restart canceled" execDuration=5m11.959514138s exitStatus="{137 false 2023-03-05 20:21:03.697639746 +0000 UTC}" hasBeenManuallyStopped=true restartCount=0
Mar 05 20:21:03 poweredge nomad[443]:     2023-03-05T20:21:03.777Z [INFO]  client.driver_mgr.docker: removed untracked container: driver=docker container_id=388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031
Mar 05 20:21:04 poweredge dockerd[650]: time="2023-03-05T20:21:04.399262266Z" level=error msg="collecting stats for 4a2250245ce48cfabf76c260ee039a0fd520e1b77c309645e70c082dd98d94fc: Could not get container for 388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031: No such container: 388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031"
...
Mar 05 20:21:52 poweredge dockerd[650]: time="2023-03-05T20:21:52.133430177Z" level=error msg="collecting stats for 4a2250245ce48cfabf76c260ee039a0fd520e1b77c309645e70c082dd98d94fc: Could not get container for 388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031: No such container: 388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031"
Mar 05 20:21:54 poweredge nomad[443]:     2023-03-05T20:21:54.580Z [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=94e07944a57320c255d43e7177cdb5da79f4bef3d636c3751643404c1205a995 error="API error (404): No such container: 388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031"
Mar 05 20:21:54 poweredge nomad[443]:     2023-03-05T20:21:54.591Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9645442e-acd3-d90f-8175-d214d1e5d39f task=pihole type="Driver Failure" msg="Failed to start container 94e07944a57320c255d43e7177cdb5da79f4bef3d636c3751643404c1205a995: API error (404): No such container: 388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031" failed=false
Mar 05 20:21:54 poweredge nomad[443]:     2023-03-05T20:21:54.595Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=9645442e-acd3-d90f-8175-d214d1e5d39f task=pihole error="Failed to start container 94e07944a57320c255d43e7177cdb5da79f4bef3d636c3751643404c1205a995: API error (404): No such container: 388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031"
Mar 05 20:21:58 poweredge nomad[443]:     2023-03-05T20:21:58.646Z [ERROR] client.alloc_runner: postrun failed: alloc_id=9645442e-acd3-d90f-8175-d214d1e5d39f error="hook \"network\" failed: No such container: 388ff523b94e7f66a9c0ce7ef7e2eb239f20b05daf92e242005dde9149cf4031"

dani · 2023-03-05T21:58:57Z

Indeed, it might well be linked to the pause containers : they are missing on my install too. And all my jobs are using bridge networking

dani · 2023-03-05T22:54:44Z

Here's a simple reproducer job

job "test" {
  datacenters = ["dc1"]

  group "test" {

    network {
      mode = "bridge"
    }

    service {
      name = "test"
      connect {
        gateway {
          terminating {
            service {
              name = "proxyout"
            }
          }
        }
      }
    }
  }
}

Just run it, wait ~10min (for the pause container to vanish), and try to restart the connect task from the GUI.

dani · 2023-03-06T14:15:18Z

Could be some regression introduced with #15732

jrasell · 2023-03-06T14:56:59Z

Hi @dani and thanks for raising this issue. I was able to reproduce this locally using the job spec you provided running both Nomad and Consul in dev modes.

here is the docker ps output just after the job has been submitted:

vagrant@linux:/tmp$ docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED          STATUS          PORTS     NAMES
eba104d1a118   envoyproxy/envoy:v1.18.4                   "/docker-entrypoint.…"   19 seconds ago   Up 18 seconds             connect-terminating-test-90c66ff1-886d-739f-8b82-1bc843a95d51
3ba236522d33   gcr.io/google_containers/pause-amd64:3.1   "/pause"                 25 seconds ago   Up 24 seconds             nomad_init_90c66ff1-886d-739f-8b82-1bc843a95d51

here is the same command after some time has passed, the job is still running at this point:

vagrant@linux:/tmp$ docker ps
CONTAINER ID   IMAGE                      COMMAND                  CREATED          STATUS          PORTS     NAMES
db5efac5d4e7   envoyproxy/envoy:v1.18.4   "/docker-entrypoint.…"   16 minutes ago   Up 16 minutes             connect-terminating-test-049b08a6-e76e-dcb0-ff5f-fe0b4c7bcc2a

I redirected the logs from Nomad to a file; when searching for the pause container ID within them:

    2023-03-06T13:06:57.616Z [DEBUG] client.driver_mgr.docker: configuring network mode for task group: driver=docker task_name=connect-terminating-test network_mode=container:362414f862bf6077766694632830d5213cd025a502345b649aa0107abe843eeb
    2023-03-06T13:16:47.397Z [INFO]  client.driver_mgr.docker: removed untracked container: driver=docker container_id=362414f862bf6077766694632830d5213cd025a502345b649aa0107abe843eeb
    2023-03-06T13:24:24.373Z [DEBUG] client.driver_mgr.docker: configuring network mode for task group: driver=docker task_name=connect-terminating-test network_mode=container:362414f862bf6077766694632830d5213cd025a502345b649aa0107abe843eeb
    2023-03-06T13:24:24.446Z [DEBUG] client.driver_mgr.docker: failed to start container: driver=docker container_id=fec5b89cda88a17e436d58c2502e6c8cc446cb82515811215b3adab73fd18f3f attempt=1 error="API error (404): No such container: 362414f862bf6077766694632830d5213cd025a502345b649aa0107abe843eeb"
    2023-03-06T13:24:24.446Z [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=fec5b89cda88a17e436d58c2502e6c8cc446cb82515811215b3adab73fd18f3f error="API error (404): No such container: 362414f862bf6077766694632830d5213cd025a502345b649aa0107abe843eeb"
    2023-03-06T13:24:24.451Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=049b08a6-e76e-dcb0-ff5f-fe0b4c7bcc2a task=connect-terminating-test type="Driver Failure" msg="Failed to start container fec5b89cda88a17e436d58c2502e6c8cc446cb82515811215b3adab73fd18f3f: API error (404): No such container: 362414f862bf6077766694632830d5213cd025a502345b649aa0107abe843eeb" failed=false
    2023-03-06T13:24:24.451Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=049b08a6-e76e-dcb0-ff5f-fe0b4c7bcc2a task=connect-terminating-test error="Failed to start container fec5b89cda88a17e436d58c2502e6c8cc446cb82515811215b3adab73fd18f3f: API error (404): No such container: 362414f862bf6077766694632830d5213cd025a502345b649aa0107abe843eeb"
    2023-03-06T13:24:28.503Z [ERROR] client.alloc_runner: postrun failed: alloc_id=049b08a6-e76e-dcb0-ff5f-fe0b4c7bcc2a error="hook \"network\" failed: No such container: 362414f862bf6077766694632830d5213cd025a502345b649aa0107abe843eeb"

The most interesting line seems to be client.driver_mgr.docker: removed untracked container: driver=docker container_id=362414f862bf6077766694632830d5213cd025a502345b649aa0107abe843eeb which indicates we are incorrectly tracking the use of the container, or something related to this. I do not believe #15732 is related, as this problem doesn't directly seem related to restarts. I currently wonder whether #15962 or #15898 are involved here.

I'll raise this internally, so we can get this looked into straight away.

shoenig · 2023-03-06T16:32:39Z

I believe we understand the bug now - for an immediate workaround you should be able to disable dangling container reconciliation https://developer.hashicorp.com/nomad/docs/drivers/docker#enabled to stop this from happening.

Basically when we added the com.hashicorp.nomad.alloc_id label to pause containers in #15898 we failed to realize the pause container was not being considered as one of the "tracked containers" and would always be subject to reconciliation - which typically happens after 10 minutes which is the default creation_grace period.

We'll get a proper fix out shortly.

dani added the type/bug label Mar 5, 2023

jrasell self-assigned this Mar 6, 2023

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 6, 2023

jrasell moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Mar 6, 2023

jrasell moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Mar 6, 2023

jrasell added theme/driver/docker stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Mar 6, 2023

jrasell pinned this issue Mar 6, 2023

mikenomitch added this to the 1.5.1 milestone Mar 6, 2023

shoenig mentioned this issue Mar 6, 2023

docker: fix bug where network pause containers would be erroneously reconciled #16352

Merged

jrasell moved this from Needs Roadmapping to In Progress in Nomad - Community Issues Triage Mar 7, 2023

jrasell mentioned this issue Mar 7, 2023

docs: add 1.5.0, 1.4.5, and 1.3.10 pause regression upgrade note. #16358

Merged

hc-github-team-nomad-core added a commit that referenced this issue Mar 7, 2023

Merge 422d9da into backport/b-jrasell/gh-16338/accurately-tidy-mollusk

9bbba1e

hc-github-team-nomad-core added a commit that referenced this issue Mar 7, 2023

Merge 422d9da into backport/b-jrasell/gh-16338/currently-curious-shrimp

0131132

hc-github-team-nomad-core mentioned this issue Mar 7, 2023

Backport of docs: add 1.5.0, 1.4.5, and 1.3.10 pause regression upgrade note. into release/1.5.x #16373

Merged

shoenig closed this as completed in #16352 Mar 7, 2023

Nomad - Community Issues Triage automation moved this from In Progress to Done Mar 7, 2023

inflatador mentioned this issue Mar 11, 2023

Jobs using larger images can't start hashicorp/nomad-driver-podman#219

Closed

shoenig unpinned this issue Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks using Docker drivers frequently failed to restart #16338

Tasks using Docker drivers frequently failed to restart #16338

dani commented Mar 5, 2023

plasmastorm commented Mar 5, 2023

plasmastorm commented Mar 5, 2023

dani commented Mar 5, 2023

dani commented Mar 5, 2023

dani commented Mar 6, 2023

jrasell commented Mar 6, 2023

shoenig commented Mar 6, 2023

Tasks using Docker drivers frequently failed to restart #16338

Tasks using Docker drivers frequently failed to restart #16338

Comments

dani commented Mar 5, 2023

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

plasmastorm commented Mar 5, 2023

plasmastorm commented Mar 5, 2023

dani commented Mar 5, 2023

dani commented Mar 5, 2023

dani commented Mar 6, 2023

jrasell commented Mar 6, 2023

shoenig commented Mar 6, 2023