docker: stop network pause container of lost alloc after node restart #17455

shoenig · 2023-06-07T18:13:51Z

This PR fixes a bug where the docker network pause container would not be
stopped and removed in the case where a node is restarted, the alloc is
moved to another node, the node comes back up. See the issue below for
full repro conditions.

Basically in the DestroyNetwork PostRun hook we would depend on the
NetworkIsolationResource field not being nil - which is only the case
if the Client stays alive all the way from network creation to network
teardown. If the node is rebooted we lose that state and previously
would not be able to find the pause container to remove. Now, we manually
find the pause container by scanning them and looking for the associated
allocID.

Fixes #17299

shoenig · 2023-06-07T19:26:11Z

Spot check (all we have since this requires reboots)

Setup: 3 nodes (1 server + 2 clients)

ubuntu@ip-172-31-25-137:~$ nomad node status
ID        Node Pool  DC   Name              Class   Drain  Eligibility  Status
ded51f46  default    dc1  ip-172-31-19-192  <none>  false  eligible     ready
faf9c60b  default    dc1  ip-172-31-24-55   <none>  false  eligible     ready

Basic docker redis job with bridge network mode

job "redis" {

  group "cache" {
    network {
      mode = "bridge"
      port "db" {
        to = 6379
      }
    }

    task "redis" {
      driver = "docker"

      config {
        image          = "redis:7"
        ports          = ["db"]
        auth_soft_fail = true
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Run job. Show docker ps -a on both nodes.

ubuntu@ip-172-31-19-192:~$ sudo docker ps -a
CONTAINER ID   IMAGE                                      COMMAND                  CREATED          STATUS          PORTS     NAMES
c785e8e57553   redis:7                                    "docker-entrypoint.s…"   50 seconds ago   Up 49 seconds             redis-99a82eee-3965-28cf-b21d-43ea8db1b03f
16ce1901172c   gcr.io/google_containers/pause-amd64:3.1   "/pause"                 50 seconds ago   Up 49 seconds             nomad_init_99a82eee-3965-28cf-b21d-43ea8db1b03f

ubuntu@ip-172-31-24-55:~$ sudo docker ps -a
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
<empty>

Reboot the node where the alloc is running.

ubuntu@ip-172-31-19-192:~$ sudo reboot
Connection to ec2-54-91-84-184.compute-1.amazonaws.com closed by remote host.

Wait for alloc to be replaced on the other node

ubuntu@ip-172-31-25-137:~$ nomad job status redis | tail -n4
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
b41b9b72  faf9c60b  cache       0        run      running  38s ago    9s ago
99a82eee  ded51f46  cache       0        stop     lost     2m27s ago  10s ago

See docker ps -a on old node is unclean (nomad agent not started)

ubuntu@ip-172-31-19-192:~$ sudo docker ps -a
CONTAINER ID   IMAGE                                      COMMAND                  CREATED         STATUS                          PORTS     NAMES
c785e8e57553   redis:7                                    "docker-entrypoint.s…"   2 minutes ago   Exited (0) About a minute ago             redis-99a82eee-3965-28cf-b21d-43ea8db1b03f
16ce1901172c   gcr.io/google_containers/pause-amd64:3.1   "/pause"                 2 minutes ago   Up 39 seconds                             nomad_init_99a82eee-3965-28cf-b21d-43ea8db1b03f

Start nomad agent

ubuntu@ip-172-31-19-192:~$ sudo service nomad start

Now docker ps -a we see containers are cleaned up

ubuntu@ip-172-31-19-192:~$ sudo docker ps -a
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
<empty>

This PR fixes a bug where the docker network pause container would not be stopped and removed in the case where a node is restarted, the alloc is moved to another node, the node comes back up. See the issue below for full repro conditions. Basically in the DestroyNetwork PostRun hook we would depend on the NetworkIsolationSpec field not being nil - which is only the case if the Client stays alive all the way from network creation to network teardown. If the node is rebooted we lose that state and previously would not be able to find the pause container to remove. Now, we manually find the pause container by scanning them and looking for the associated allocID. Fixes #17299

tgross · 2023-06-08T13:29:38Z

plugins/drivers/client.go

+	if spec == nil {
+		return nil
+	}


I think I'm having trouble following this. If we return here, how does the Docker driver ever get the DestroyNetwork call so that it can run the findPauseContainer?

This path only applies to the other kind of docker managed [driver] network where the create/destroy happen over RPC. In the group network case (where we use a pause container) the create/destroy are invoked directly* by the client on the docker driver implementation.

where directly means through the interface handle backed by the creation of the handle through the RPC dispatch.

This code is really difficult to follow.

tgross

LGTM!

vercel bot deployed to Preview – nomad-storybook-and-ui June 7, 2023 18:18 View deployment

shoenig force-pushed the docker-infra-leak branch from 657404b to 3ad9e41 Compare June 7, 2023 18:20

vercel bot deployed to Preview – nomad-storybook-and-ui June 7, 2023 18:25 View deployment

shoenig force-pushed the docker-infra-leak branch from 3ad9e41 to 295d5ce Compare June 7, 2023 18:36

vercel bot deployed to Preview – nomad-storybook-and-ui June 7, 2023 18:39 View deployment

shoenig added backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Jun 7, 2023

shoenig added this to the 1.5.x milestone Jun 7, 2023

shoenig force-pushed the docker-infra-leak branch from 295d5ce to 5f87246 Compare June 8, 2023 12:48

vercel bot deployed to Preview – nomad-storybook-and-ui June 8, 2023 12:53 View deployment

shoenig marked this pull request as ready for review June 8, 2023 13:24

shoenig requested review from lgfa29 and tgross June 8, 2023 13:25

tgross reviewed Jun 8, 2023

View reviewed changes

tgross approved these changes Jun 9, 2023

View reviewed changes

shoenig removed backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line labels Jun 9, 2023

shoenig merged commit 89ce092 into main Jun 9, 2023

shoenig deleted the docker-infra-leak branch June 9, 2023 13:46

hc-github-team-nomad-core mentioned this pull request Jun 9, 2023

Backport of docker: stop network pause container of lost alloc after node restart into release/1.5.x #17466

Merged

lgfa29 mentioned this pull request Jun 13, 2023

client: fix panic on alloc stop in non-Linux environments #17515

Merged

hc-github-team-nomad-core mentioned this pull request Jun 14, 2023

Backport of client: fix panic on alloc stop in non-Linux environments into release/1.5.x #17522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker: stop network pause container of lost alloc after node restart #17455

docker: stop network pause container of lost alloc after node restart #17455

shoenig commented Jun 7, 2023

shoenig commented Jun 7, 2023

tgross Jun 8, 2023

shoenig Jun 9, 2023 •

edited

Loading

tgross left a comment

docker: stop network pause container of lost alloc after node restart #17455

docker: stop network pause container of lost alloc after node restart #17455

Conversation

shoenig commented Jun 7, 2023

shoenig commented Jun 7, 2023

tgross Jun 8, 2023

Choose a reason for hiding this comment

shoenig Jun 9, 2023 • edited Loading

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

shoenig Jun 9, 2023 •

edited

Loading