Pause container doesn't restart when docker restarts causing dependent containers to fail #10556

wesgur · 2021-05-10T18:15:53Z

Nomad version

Nomad v1.0.1

Operating system and Environment details

$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Issue

I have a a nomad job with a task group running in network mode bridge. When we restart docker, the job fails to come up with the following error:

Failed to start container ***(container id)***: API error (409): cannot join network of a non running container: ***(pause container id)***

The task is successfully allocated when I run the job initially. When docker is restarted, both the pause container and task container are failing.

Reproduction steps

Run nomad job with network mode bridge
Restart Docker (systemctl restart docker)

Expected Result

Nomad job succeeds. Task and pause container both restarted properly.

Actual Result

Nomad job fails to allocate. Both task and pause container are failing.

Job file (if appropriate)

{
  "Stop": false,
  "Region": "global",
  "Namespace": "default",
  "ID": "some-id",
  "ParentID": "",
  "Name": "some-id",
  "Type": "system",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "datacenter-id"
  ],
  "Constraints": null,
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "some-name",
      "Count": 1,
      "Update": null,
      "Migrate": null,
      "Constraints": null,
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 2,
        "Interval": 1800000000000,
        "Delay": 15000000000,
        "Mode": "fail"
      },
      "Tasks": [
        {
          "Name": "some-name",
          "Driver": "docker",
          "User": "",
          "Config": {
            "image": "haproxy:local",
            "auth_soft_fail": true
          },
          "Services": null,
          "Vault": null,
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 100,
            "MemoryMB": 300,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null
          },
          "RestartPolicy": {
            "Attempts": 2,
            "Interval": 1800000000000,
            "Delay": 15000000000,
            "Mode": "fail"
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 10,
            "MaxFileSizeMB": 10
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 300,
        "Migrate": false
      },
      "Meta": null,
      "ReschedulePolicy": null,
      "Affinities": null,
      "Spreads": null,
      "Networks": [
        {
          "Mode": "bridge",
          "Device": "",
          "CIDR": "",
          "IP": "",
          "MBits": 0,
          "DNS": null,
          "ReservedPorts": [
            {
              "Label": "some-name",
              "Value": 7030,
              "To": 0,
              "HostNetwork": "default"
            }
          ],
          "DynamicPorts": null
        }
      ],
      "Services": [
        {
          "Name": "some-id",
          "TaskName": "",
          "PortLabel": "some-name",
          "AddressMode": "auto",
          "EnableTagOverride": false,
          "Tags": null,
          "CanaryTags": null,
          "Checks": null,
          "Connect": null,
          "Meta": null,
          "CanaryMeta": null
        }
      ],
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null
    }
  ],
  "Update": {
    "Stagger": 0,
    "MaxParallel": 0,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "Payload": null,
  "ConsulToken": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "running",
  "StatusDescription": "",
  "Stable": false,
  "Version": 0,
  "SubmitTime": 1620665826438678300,
  "CreateIndex": 27,
  "ModifyIndex": 27,
  "JobModifyIndex": 27
}

The text was updated successfully, but these errors were encountered:

drewbailey · 2021-05-12T17:42:42Z

Hi @wesgur thanks for reporting. I was able to reproduce this and am including some steps below as well as an example of it working normally without bridge networking. I'll talk with the team to explore possible solutions and provide an update shortly.

jobfile

job "example" {
  datacenters = ["dc1"]

  group "cache" {
    network {
      mode = "bridge"
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

failure

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad status 7e1
ID                     = 7e123a27-0762-8fd5-c7cf-1b0f33a81129
Eval ID                = 6248a15a
Name                   = example.cache[0]
Node ID                = bd41f3b0
Node Name              = linux
Job ID                 = example
Job Version            = 0
Client Status          = failed
Client Description     = Failed tasks
Desired Status         = run
Desired Description    = <none>
Created                = 1m20s ago
Modified               = 22s ago
Deployment ID          = d310f240
Deployment Health      = healthy
Reschedule Eligibility = 4s from now

Allocation Addresses (mode = "bridge")
Label  Dynamic  Address
*db    yes      127.0.0.1:31671 -> 6379

Task "redis" is "dead"
Task Resources
CPU        Memory           Disk     Addresses
3/500 MHz  984 KiB/256 MiB  300 MiB

Task Events:
Started At     = 2021-05-12T17:36:21Z
Finished At    = 2021-05-12T17:37:12Z
Total Restarts = 1
Last Restart   = 2021-05-12T17:36:55Z

Recent Events:
Time                  Type            Description
2021-05-12T17:37:12Z  Killing         Sent interrupt. Waiting 5s before force killing
2021-05-12T17:37:12Z  Not Restarting  Error was unrecoverable
2021-05-12T17:37:12Z  Driver Failure  Failed to start container 4dc2e7fdad7c9f79b70c0c56b56006c5d24217817b0d409c4545ddf6d58b2df3: API error (409): cannot join network of a non running container: eff2639515d73f8fe16f952e8e022afad4a370d56ab0448dd6947e1187fc6053
2021-05-12T17:36:55Z  Restarting      Task restarting in 17.663696471s
2021-05-12T17:36:55Z  Terminated      Exit Code: 0
2021-05-12T17:36:21Z  Started         Task started by client
2021-05-12T17:36:20Z  Task Setup      Building Task Directory
2021-05-12T17:36:19Z  Received        Task received by client

non bridge pass

vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad status a48
ID                  = a486ec41-babc-bcec-6804-269b9edfaa84
Eval ID             = 6e8f58f4
Name                = example.cache[0]
Node ID             = db590030
Node Name           = linux
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 1m8s ago
Modified            = 0s ago
Deployment ID       = 39ef9c25
Deployment Health   = healthy

Allocation Addresses
Label  Dynamic  Address
*db    yes      127.0.0.1:24558 -> 6379

Task "redis" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/500 MHz  988 KiB/256 MiB  300 MiB

Task Events:
Started At     = 2021-05-12T17:34:04Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2021-05-12T17:33:45Z

Recent Events:
Time                  Type        Description
2021-05-12T17:34:04Z  Started     Task started by client
2021-05-12T17:33:45Z  Restarting  Task restarting in 17.29609191s
2021-05-12T17:33:45Z  Terminated  Exit Code: 0
2021-05-12T17:33:05Z  Started     Task started by client
2021-05-12T17:32:57Z  Driver      Downloading image
2021-05-12T17:32:57Z  Task Setup  Building Task Directory
2021-05-12T17:32:57Z  Received    Task received by client

wesgur · 2021-05-12T17:56:32Z

Thanks @drewbailey for the quick follow up. Would you be able to provide me with the job file that you used for non bridge pass. Would like to take a look into it 👍

drewbailey · 2021-05-12T18:00:44Z

Sure thing, it was just the equivalent of nomad job init --short example.nomad, From the previous jobspec just remove mode entirely which will fall back to host networking https://www.nomadproject.io/docs/job-specification/network#mode

-      mode = "bridge"

wesgur · 2021-05-13T16:54:45Z

Thanks for that. Really appreciate you looking into this. I have tried using the fall back network mode (host) but I wasn't able to get it working. This is a pretty critical bug in our system currently, would you be able to let us know what the priority on this would be once you know? Also would you happen to know if its possible to put a nomad job on a cni network but have it exposed on interface eth0? We maybe able to go around this problem if that is possible.

drewbailey · 2021-05-13T18:06:44Z

@wesgur I'm unable to provide an exact time frame. The team is hoping to address it in the next minor release, depending on prioritization and capacity. For a CNI work around you may want to check out port map plugin here: https://www.cni.dev/plugins/current/meta/portmap/

I'd also look into host network mode as well since that is the default.

tgross · 2021-06-28T15:55:30Z

Just by way of follow-up, the root technical cause here is that for tasks with network.mode = "bridge" on Linux, Nomad creates the network namespace. Because Docker containers can only reference network namespaces owned by another Docker container, we create the pause container as a way of referencing that namespace (i.e. the network mode for Docker tasks will default to container:<pause container ID> when network.mode = "bridge") (For internal folks, see RFC NMD-035)

When dockerd is restarted, all its containers are stopped. This is unfortunate and something we worked hard to avoid in Nomad itself -- when you restart the Nomad client, tasks continue to run and we restore "handles" to those tasks when Nomad comes back up. So when dockerd is restarted, the containers exit and Nomad restarts them. In other scenarios where Nomad has a container it's using to support tasks, the container is run as a sidecar task and once it's launched it's visible as a Nomad task. You can see this with Consul Connect sidecars.

But the pause container is not run as a Nomad task. I suspect there are two reasons for this:

At the time we did the networking changes (0.10), we did not yet have a notion of "prestart" tasks (which shipped in 0.11).
Even if it was a prestart task, the other prestart tasks would need to wait for the pause container in order to have the correct networking setup. Which introduces a pre-prestart lifecycle phase and a whole lot more complexity to the alloc runner.

Some approaches that might work here, most of which introduce some risks of backwards compatibility problems:

Add a HostConfig.RestartPolicy value to the pause container so that dockerd will restart the container on its own.
Introduce a new type of handle stored in the client state store which we can use to restore the pause container from Nomad.
Detect the situation we're seeing here and recreate the network namespace and pause container from scratch. This will be challenging to get right for mixed-task-driver scenarios.

lgfa29 · 2023-02-03T22:15:16Z

Close by #15732.

wesgur added the type/bug label May 10, 2021

drewbailey added this to Needs Triage in Nomad - Community Issues Triage via automation May 11, 2021

drewbailey added theme/driver/docker theme/client stage/needs-investigation labels May 12, 2021

drewbailey moved this from Needs Triage to Triaging in Nomad - Community Issues Triage May 12, 2021

drewbailey self-assigned this May 12, 2021

drewbailey moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage May 13, 2021

drewbailey removed this from Needs Roadmapping in Nomad - Community Issues Triage May 13, 2021

drewbailey removed their assignment May 13, 2021

drewbailey added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/networking and removed stage/needs-investigation labels May 13, 2021

lgfa29 closed this as completed Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pause container doesn't restart when docker restarts causing dependent containers to fail #10556

Pause container doesn't restart when docker restarts causing dependent containers to fail #10556

wesgur commented May 10, 2021

drewbailey commented May 12, 2021

wesgur commented May 12, 2021

drewbailey commented May 12, 2021

wesgur commented May 13, 2021

drewbailey commented May 13, 2021

tgross commented Jun 28, 2021 •

edited

Loading

lgfa29 commented Feb 3, 2023

Pause container doesn't restart when docker restarts causing dependent containers to fail #10556

Pause container doesn't restart when docker restarts causing dependent containers to fail #10556

Comments

wesgur commented May 10, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

drewbailey commented May 12, 2021

wesgur commented May 12, 2021

drewbailey commented May 12, 2021

wesgur commented May 13, 2021

drewbailey commented May 13, 2021

tgross commented Jun 28, 2021 • edited Loading

lgfa29 commented Feb 3, 2023

tgross commented Jun 28, 2021 •

edited

Loading