failed pause containers result in failed task restarts and orphaned tasks #12216

RosieBaish · 2022-03-08T12:48:12Z

Nomad version

$ nomad --version
Nomad v1.2.3 (a79efc8)

Operating system and Environment details

$ uname -a
Linux 5.15.13 #3 SMP Wed Jan 26 09:09:47 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release
NAME="Fedora Linux"
VERSION="35 (Thirty Five)"
ID=fedora
VERSION_ID=35
VERSION_CODENAME=""
PLATFORM_ID="platform:f35"
PRETTY_NAME="Fedora Linux 35 (Thirty Five)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:35"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f35/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=35
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=35
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
DEFAULT_HOSTNAME=localhost

Issue

When using the docker driver, if one of the nomad_init_<UUID> containers dies, the other containers in the task group do not get cleaned up.
We discovered this while doing some general testing around "What happens if you kill random containers in random combinations"
I'm aware that a nomad_init_ container failing in production is highly unlikely, but I thought it was worth the report anyway!

Reproduction steps

Create and run a task group as below
$ docker container ls
3 containers, name-<UUID>, connect-proxy-name-<UUID> and nomad_init_<UUID>
$ docker stop $(id of nomad_init_)
$ docker container ls
2 containers, name-<UUID>, connect-proxy-name-<UUID>
$ systemctl restart nomad
$ docker container ls

Expected Result

3 containers, name-<UUID>, connect-proxy-name-<UUID> and nomad_init_<UUID>
Where UUID is either the same as before, or a new one

Actual Result

5 containers,

name-<new_UUID>
connect-proxy-name-<new_UUID>
nomad_init_<new_UUID>
name-<old_UUID>
connect-proxy-name-<old_UUID>
Nomad has created a new set of containers to replace the task group, but has not cleaned up the existing 2. I would expect it either to re-use them and thus re-create the nomad_init_ container, or delete them and replace them.

Job file (if appropriate)

  group "<redacted>" {
    task "<redacted>" {
      driver = "docker"
      kill_timeout = "20s"
      config {
        image = "<redacted>"
        auth {
          username = "<redacted>"
          password = "<redacted>"
        }
      }
      env {
          <redacted>
      }
      resources {
        cpu    = 200
        memory = 300
      }
    }
    network {
      mode = "bridge"
    }
    service {
      name = "<redacted>"
    }
  }

The docker image is a local one pulled from a private repo but I suspect the exact image is irrelevant.

data_dir  = "<redacted>"

datacenter = "<redacted>"

bind_addr = "0.0.0.0"

log_level = "DEBUG"

server {
  enabled = true
  bootstrap_expect = 1
  raft_protocol = 3
  encrypt = "<redacted>"
}

consul {
  address             = "127.0.0.1:8500"
  server_service_name = "nomad"
  client_service_name = "nomad-client"
  auto_advertise      = true
  server_auto_join    = true
  client_auto_join    = true
  token = "<redacted>"
}

client {
  enabled = true
  servers = ["127.0.0.1"]

  cni_path = "/opt/cni/bin"

  template {
    disable_file_sandbox = true
  }

  host_volume "<redacted>" {
    path      = "<redacted>"
    read_only = false
  }

  host_volume "<redacted>" {
    path      = "<redacted>"
    read_only = false
  }
}

plugin "docker" {
  config {
    allow_caps = ["net_admin", "net_raw", "audit_write", "chown", "dac_override", "fowner", "fsetid", "kill", "mknod", "net_bind_service", "setfcap", "setgid", "setpcap", "setuid", "sys_chroot"]
  }
}

Logs

I can provide logs if you're unable to reproduce, but it'll take a while to redact everything so if you can reproduce it I won't bother

The text was updated successfully, but these errors were encountered:

tgross · 2022-03-08T15:16:45Z

Hi @RosieBaish! I was able to more-or-less confirm the behavior you're seeing, and as it turns out the behavior on client restart is only a symptom of a different problem.

Let's use the following jobspec, so that we have a fairly simple application without a bridge network that requires the pause container (nomad_init_$uuid), but cuts out the complexity of Connect:

jobspec

job "example" {
  datacenters = ["dc1"]

  group "web" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      template {
        data        = "<html>hello, world</html>"
        destination = "local/index.html"
      }

      resources {
        cpu    = 128
        memory = 128
      }

    }
  }
}

If we run this job and look at the resulting containers we see the main container and the pause container:

$ docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED              STATUS              PORTS     NAMES
60da9b25e9b5   busybox:1                                  "httpd -v -f -p 8001…"   About a minute ago   Up About a minute             http-8dbf2069-8052-340c-377b-e62d83817dca
88911595dd35   gcr.io/google_containers/pause-amd64:3.1   "/pause"                 About a minute ago   Up About a minute             nomad_init_8dbf2069-8052-340c-377b-e62d83817dca

Now let's kill the pause container:

$ docker kill 889
889

$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED              STATUS              PORTS     NAMES
60da9b25e9b5   busybox:1   "httpd -v -f -p 8001…"   About a minute ago   Up About a minute             http-8dbf2069-8052-340c-377b-e62d83817dca

Note that if we wait a few moments, we can see that Nomad isn't recreating the pause container, which at first glance isn't a problem:

$ nomad alloc status 8db
...
Allocation Addresses (mode = "bridge")
Label  Dynamic  Address
*www   yes      10.0.2.15:27877 -> 8001
...

$ curl  10.0.2.15:27877
<html>hello, world</html>

But it is a problem because if the main container fails, there's no pause container to owns the network namespace anymore (this is a quirk of Docker that a particular container needs to "own" the namespace that we join other containers to, and Nomad inherits that quirk):

$ docker kill 60d
60d

$ nomad alloc status 8db
...
Recent Events:
Time                       Type            Description
2022-03-08T09:57:49-05:00  Killing         Sent interrupt. Waiting 5s before force killing
2022-03-08T09:57:49-05:00  Not Restarting  Error was unrecoverable
2022-03-08T09:57:49-05:00  Driver Failure  Failed to start container 38e4296e7d221f7d4f1dae3ed888e87f967e84ee865efb9954cac036320f411b: API error (409): cannot join network of a non running container: 88911595dd35c2bc2e80f70325ea6485795339c56c70a59f585626016885b85f
2022-03-08T09:57:49-05:00  Restarting      Task restarting in 18.16840979s
2022-03-08T09:57:49-05:00  Terminated      Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2022-03-08T09:53:31-05:00  Started         Task started by client
2022-03-08T09:53:31-05:00  Task Setup      Building Task Directory
2022-03-08T09:53:30-05:00  Received        Task received by client

Nomad will detect this condition and reschedule the allocation:

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
2976d256  23562ac8  web         0        run      running  3s ago     2s ago
8dbf2069  003c3bb7  web         0        stop     failed   4m52s ago  3s ago

Now let's try to reproduce the client restart behavior you saw. I'll head over to the new alloc and kill the pause container there:

$ docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED         STATUS         PORTS     NAMES
98b53b629172   busybox:1                                  "httpd -v -f -p 8001…"   9 minutes ago   Up 9 minutes             http-2976d256-4a5d-2062-e639-27a9502c585d
ddfa6cfa5593   gcr.io/google_containers/pause-amd64:3.1   "/pause"                 9 minutes ago   Up 9 minutes             nomad_init_2976d256-4a5d-2062-e639-27a9502c585d

$ docker kill ddf
ddf

$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED         STATUS         PORTS     NAMES
98b53b629172   busybox:1   "httpd -v -f -p 8001…"   9 minutes ago   Up 9 minutes             http-2976d256-4a5d-2062-e639-27a9502c585d

Then I restarted the Nomad client and got the following:

$ docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED          STATUS          PORTS     NAMES
b705ab4232c5   gcr.io/google_containers/pause-amd64:3.1   "/pause"                 6 seconds ago    Up 5 seconds              nomad_init_2976d256-4a5d-2062-e639-27a9502c585d
98b53b629172   busybox:1                                  "httpd -v -f -p 8001…"   10 minutes ago   Up 10 minutes             http-2976d256-4a5d-2062-e639-27a9502c585d

But if I wait a moment, the pause container fails, leaving just the old container running:

$ docker ps
CONTAINER ID   IMAGE       COMMAND                  CREATED          STATUS          PORTS     NAMES
98b53b629172   busybox:1   "httpd -v -f -p 8001…"   10 minutes ago   Up 10 minutes             http-2976d256-4a5d-2062-e639-27a9502c585d

What happened? Let's ask the allocation:

$ nomad alloc status 297
...

Recent Events:
Time                       Type           Description
2022-03-08T10:09:09-05:00  Killing        Sent interrupt. Waiting 5s before force killing
2022-03-08T10:09:09-05:00  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.69 has been allocated to 2976d256-4a5d-2062-e639-27a9502c585d, duplicate allocation is not allowed
2022-03-08T09:58:20-05:00  Started        Task started by client
2022-03-08T09:58:20-05:00  Task Setup     Building Task Directory
2022-03-08T09:58:19-05:00  Received       Task received by client

The full error state from the client logs is as follows:

2022-03-08T15:09:06.261Z [DEBUG] client.driver_mgr.docker: failed to create container: driver=docker container_name=nomad_init_2976d256-4a5d-2062-e639-27a9502c585d image_name=gcr.io/google_containers/pause-amd64:3.1 image_id=gcr.io/google_containers/pause-amd64:3.1 attempt=1 error="container already exists"
2022-03-08T15:09:06.286Z [INFO] client.driver_mgr.docker: purged container: driver=docker container_id=ddfa6cfa55932e6e2483883cf721917c66e74316e5f615d883b57df0dae012d7
2022-03-08T15:09:06.748Z [WARN] client.alloc_runner.runner_hook: failed to configure network: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d err="plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.69 has been allocated to 2976d256-4a5d-2062-e639-27a9502c585d, duplicate allocation is not allowed" attempt=1
2022-03-08T15:09:07.981Z [WARN] client.alloc_runner.runner_hook: failed to configure network: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d err="plugin type="bridge" failed (add): container veth name provided (eth0) already exists" attempt=2
2022-03-08T15:09:09.536Z [WARN] client.alloc_runner.runner_hook: failed to configure network: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d err="plugin type="bridge" failed (add): container veth name provided (eth0) already exists" attempt=3
2022-03-08T15:09:09.536Z [ERROR] client.alloc_runner: prerun failed: alloc_id=2976d256-4a5d-2062-e639-27a9502c585d error="pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.69 has been allocated to 2976d256-4a5d-2062-e639-27a9502c585d, duplicate allocation is not allowed"

So there's two issues here:

Nomad isn't recreating the pause container when it should be. I'm not sure there's a good option here for us because of the way that Docker expects us to create the new container in the namespace of another container, rather than assigning the new container to a pre-existing namespace.
When Nomad restores the allocation after a restart, it doesn't correctly handle the case where the pause container is gone, and should either recreate the pause container (which is tricky because of the previous point), or it should replace the alloc entirely.

@RosieBaish there's definitely a bug here but I'm going to need to take it back to the team to figure out a path forward and to set aside some time for us to figure out a fix. I'm going to mark this issue for roadmapping. Thanks so much for the thourough bug report!

shoenig · 2023-01-09T19:17:50Z

I believe we can effectively minimize this failure scenario by configuring the pause container with an auto-restart policy. Technically there will still be a small window during the restart where if the main task fails it will not be able to join the network namespace. But if your pause container and main task fail at the same time, most likely something has gone very sideways on the node that Nomad isn't going to be able to handle anyway. The reality is docker's handling of network namespaces is very poor - if surviving this scenario is important than docker is not a suitable tool.

This PR modifies the configuration of the networking pause contaier to include the "unless-stopped" restart policy. The pause container should always be restored into a running state until Nomad itself issues a stop command for the container. This is not a _perfect_ fix for #12216 but it should cover the 99% use case - where a pause container gets accidently stopped / killed for some reason. There is still a possibility where the pause container and main task container are stopped and started in the order where the bad behavior persists, but this is fundamentally unavoidable due to how docker itself abstracts and manages the underlying network namespace referenced by the containers. Closes #12216

github-actions · 2023-05-11T02:09:41Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

RosieBaish added the type/bug label Mar 8, 2022

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Mar 8, 2022

tgross self-assigned this Mar 8, 2022

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Mar 8, 2022

tgross changed the title ~~Failure in nomad_init_<UUID> container leads to orphaned containers~~ failed pause containers result in failed task restarts and orphaned tasks Mar 8, 2022

tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Mar 8, 2022

tgross added theme/networking theme/driver/docker labels Mar 8, 2022

tgross removed their assignment Mar 8, 2022

blmhemu mentioned this issue Oct 18, 2022

System jobs could not recover after abrupt node restart. #14932

Closed

mikenomitch assigned shoenig Oct 25, 2022

shoenig mentioned this issue Jan 9, 2023

docker: configure restart policy for networking pause container #15732

Merged

shoenig closed this as completed in #15732 Jan 10, 2023

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Jan 10, 2023

github-actions bot locked as resolved and limited conversation to collaborators May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed pause containers result in failed task restarts and orphaned tasks #12216

failed pause containers result in failed task restarts and orphaned tasks #12216

RosieBaish commented Mar 8, 2022

tgross commented Mar 8, 2022

shoenig commented Jan 9, 2023 •

edited

Loading

github-actions bot commented May 11, 2023

failed pause containers result in failed task restarts and orphaned tasks #12216

failed pause containers result in failed task restarts and orphaned tasks #12216

Comments

RosieBaish commented Mar 8, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Logs

tgross commented Mar 8, 2022

shoenig commented Jan 9, 2023 • edited Loading

github-actions bot commented May 11, 2023

shoenig commented Jan 9, 2023 •

edited

Loading