Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pause container doesn't restart when docker restarts causing dependent containers to fail #10556

Closed
wesgur opened this issue May 10, 2021 · 7 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client theme/driver/docker theme/networking type/bug

Comments

@wesgur
Copy link

wesgur commented May 10, 2021

Nomad version

Nomad v1.0.1

Operating system and Environment details

$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Issue

I have a a nomad job with a task group running in network mode bridge. When we restart docker, the job fails to come up with the following error:

Failed to start container ***(container id)***: API error (409): cannot join network of a non running container: ***(pause container id)***

The task is successfully allocated when I run the job initially. When docker is restarted, both the pause container and task container are failing.

Reproduction steps

  1. Run nomad job with network mode bridge
  2. Restart Docker (systemctl restart docker)

Expected Result

Nomad job succeeds. Task and pause container both restarted properly.

Actual Result

Nomad job fails to allocate. Both task and pause container are failing.

Job file (if appropriate)

{
  "Stop": false,
  "Region": "global",
  "Namespace": "default",
  "ID": "some-id",
  "ParentID": "",
  "Name": "some-id",
  "Type": "system",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "datacenter-id"
  ],
  "Constraints": null,
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "some-name",
      "Count": 1,
      "Update": null,
      "Migrate": null,
      "Constraints": null,
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 2,
        "Interval": 1800000000000,
        "Delay": 15000000000,
        "Mode": "fail"
      },
      "Tasks": [
        {
          "Name": "some-name",
          "Driver": "docker",
          "User": "",
          "Config": {
            "image": "haproxy:local",
            "auth_soft_fail": true
          },
          "Services": null,
          "Vault": null,
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 100,
            "MemoryMB": 300,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null
          },
          "RestartPolicy": {
            "Attempts": 2,
            "Interval": 1800000000000,
            "Delay": 15000000000,
            "Mode": "fail"
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": null,
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 10,
            "MaxFileSizeMB": 10
          },
          "Artifacts": null,
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 300,
        "Migrate": false
      },
      "Meta": null,
      "ReschedulePolicy": null,
      "Affinities": null,
      "Spreads": null,
      "Networks": [
        {
          "Mode": "bridge",
          "Device": "",
          "CIDR": "",
          "IP": "",
          "MBits": 0,
          "DNS": null,
          "ReservedPorts": [
            {
              "Label": "some-name",
              "Value": 7030,
              "To": 0,
              "HostNetwork": "default"
            }
          ],
          "DynamicPorts": null
        }
      ],
      "Services": [
        {
          "Name": "some-id",
          "TaskName": "",
          "PortLabel": "some-name",
          "AddressMode": "auto",
          "EnableTagOverride": false,
          "Tags": null,
          "CanaryTags": null,
          "Checks": null,
          "Connect": null,
          "Meta": null,
          "CanaryMeta": null
        }
      ],
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null
    }
  ],
  "Update": {
    "Stagger": 0,
    "MaxParallel": 0,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "Payload": null,
  "ConsulToken": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "running",
  "StatusDescription": "",
  "Stable": false,
  "Version": 0,
  "SubmitTime": 1620665826438678300,
  "CreateIndex": 27,
  "ModifyIndex": 27,
  "JobModifyIndex": 27
}
@drewbailey
Copy link
Contributor

Hi @wesgur thanks for reporting. I was able to reproduce this and am including some steps below as well as an example of it working normally without bridge networking. I'll talk with the team to explore possible solutions and provide an update shortly.

jobfile
job "example" {
  datacenters = ["dc1"]

  group "cache" {
    network {
      mode = "bridge"
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image = "redis:3.2"

        ports = ["db"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}
failure
vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad status 7e1
ID                     = 7e123a27-0762-8fd5-c7cf-1b0f33a81129
Eval ID                = 6248a15a
Name                   = example.cache[0]
Node ID                = bd41f3b0
Node Name              = linux
Job ID                 = example
Job Version            = 0
Client Status          = failed
Client Description     = Failed tasks
Desired Status         = run
Desired Description    = <none>
Created                = 1m20s ago
Modified               = 22s ago
Deployment ID          = d310f240
Deployment Health      = healthy
Reschedule Eligibility = 4s from now

Allocation Addresses (mode = "bridge")
Label  Dynamic  Address
*db    yes      127.0.0.1:31671 -> 6379

Task "redis" is "dead"
Task Resources
CPU        Memory           Disk     Addresses
3/500 MHz  984 KiB/256 MiB  300 MiB

Task Events:
Started At     = 2021-05-12T17:36:21Z
Finished At    = 2021-05-12T17:37:12Z
Total Restarts = 1
Last Restart   = 2021-05-12T17:36:55Z

Recent Events:
Time                  Type            Description
2021-05-12T17:37:12Z  Killing         Sent interrupt. Waiting 5s before force killing
2021-05-12T17:37:12Z  Not Restarting  Error was unrecoverable
2021-05-12T17:37:12Z  Driver Failure  Failed to start container 4dc2e7fdad7c9f79b70c0c56b56006c5d24217817b0d409c4545ddf6d58b2df3: API error (409): cannot join network of a non running container: eff2639515d73f8fe16f952e8e022afad4a370d56ab0448dd6947e1187fc6053
2021-05-12T17:36:55Z  Restarting      Task restarting in 17.663696471s
2021-05-12T17:36:55Z  Terminated      Exit Code: 0
2021-05-12T17:36:21Z  Started         Task started by client
2021-05-12T17:36:20Z  Task Setup      Building Task Directory
2021-05-12T17:36:19Z  Received        Task received by client

non bridge pass
vagrant@linux:/opt/gopath/src/github.com/hashicorp/nomad$ nomad status a48
ID                  = a486ec41-babc-bcec-6804-269b9edfaa84
Eval ID             = 6e8f58f4
Name                = example.cache[0]
Node ID             = db590030
Node Name           = linux
Job ID              = example
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 1m8s ago
Modified            = 0s ago
Deployment ID       = 39ef9c25
Deployment Health   = healthy

Allocation Addresses
Label  Dynamic  Address
*db    yes      127.0.0.1:24558 -> 6379

Task "redis" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/500 MHz  988 KiB/256 MiB  300 MiB

Task Events:
Started At     = 2021-05-12T17:34:04Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2021-05-12T17:33:45Z

Recent Events:
Time                  Type        Description
2021-05-12T17:34:04Z  Started     Task started by client
2021-05-12T17:33:45Z  Restarting  Task restarting in 17.29609191s
2021-05-12T17:33:45Z  Terminated  Exit Code: 0
2021-05-12T17:33:05Z  Started     Task started by client
2021-05-12T17:32:57Z  Driver      Downloading image
2021-05-12T17:32:57Z  Task Setup  Building Task Directory
2021-05-12T17:32:57Z  Received    Task received by client

@wesgur
Copy link
Author

wesgur commented May 12, 2021

Thanks @drewbailey for the quick follow up. Would you be able to provide me with the job file that you used for non bridge pass. Would like to take a look into it 👍

@drewbailey
Copy link
Contributor

Sure thing, it was just the equivalent of nomad job init --short example.nomad, From the previous jobspec just remove mode entirely which will fall back to host networking https://www.nomadproject.io/docs/job-specification/network#mode

-      mode = "bridge"

@wesgur
Copy link
Author

wesgur commented May 13, 2021

Thanks for that. Really appreciate you looking into this. I have tried using the fall back network mode (host) but I wasn't able to get it working. This is a pretty critical bug in our system currently, would you be able to let us know what the priority on this would be once you know? Also would you happen to know if its possible to put a nomad job on a cni network but have it exposed on interface eth0? We maybe able to go around this problem if that is possible.

@drewbailey drewbailey moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage May 13, 2021
@drewbailey drewbailey removed this from Needs Roadmapping in Nomad - Community Issues Triage May 13, 2021
@drewbailey
Copy link
Contributor

@wesgur I'm unable to provide an exact time frame. The team is hoping to address it in the next minor release, depending on prioritization and capacity. For a CNI work around you may want to check out port map plugin here: https://www.cni.dev/plugins/current/meta/portmap/

I'd also look into host network mode as well since that is the default.

@drewbailey drewbailey removed their assignment May 13, 2021
@drewbailey drewbailey added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/networking and removed stage/needs-investigation labels May 13, 2021
@tgross
Copy link
Member

tgross commented Jun 28, 2021

Just by way of follow-up, the root technical cause here is that for tasks with network.mode = "bridge" on Linux, Nomad creates the network namespace. Because Docker containers can only reference network namespaces owned by another Docker container, we create the pause container as a way of referencing that namespace (i.e. the network mode for Docker tasks will default to container:<pause container ID> when network.mode = "bridge") (For internal folks, see RFC NMD-035)

When dockerd is restarted, all its containers are stopped. This is unfortunate and something we worked hard to avoid in Nomad itself -- when you restart the Nomad client, tasks continue to run and we restore "handles" to those tasks when Nomad comes back up. So when dockerd is restarted, the containers exit and Nomad restarts them. In other scenarios where Nomad has a container it's using to support tasks, the container is run as a sidecar task and once it's launched it's visible as a Nomad task. You can see this with Consul Connect sidecars.

But the pause container is not run as a Nomad task. I suspect there are two reasons for this:

  • At the time we did the networking changes (0.10), we did not yet have a notion of "prestart" tasks (which shipped in 0.11).
  • Even if it was a prestart task, the other prestart tasks would need to wait for the pause container in order to have the correct networking setup. Which introduces a pre-prestart lifecycle phase and a whole lot more complexity to the alloc runner.

Some approaches that might work here, most of which introduce some risks of backwards compatibility problems:

  • Add a HostConfig.RestartPolicy value to the pause container so that dockerd will restart the container on its own.
  • Introduce a new type of handle stored in the client state store which we can use to restore the pause container from Nomad.
  • Detect the situation we're seeing here and recreate the network namespace and pause container from scratch. This will be challenging to get right for mixed-task-driver scenarios.

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 3, 2023

Close by #15732.

@lgfa29 lgfa29 closed this as completed Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/client theme/driver/docker theme/networking type/bug
Projects
None yet
Development

No branches or pull requests

4 participants