-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker: stop network pause container of lost alloc after node restart #17455
Conversation
657404b
to
3ad9e41
Compare
3ad9e41
to
295d5ce
Compare
Spot check (all we have since this requires reboots) Setup: 3 nodes (1 server + 2 clients)
Basic docker job "redis" {
group "cache" {
network {
mode = "bridge"
port "db" {
to = 6379
}
}
task "redis" {
driver = "docker"
config {
image = "redis:7"
ports = ["db"]
auth_soft_fail = true
}
resources {
cpu = 500
memory = 256
}
}
}
} Run job. Show
Reboot the node where the alloc is running.
Wait for alloc to be replaced on the other node
See
Start nomad agent
Now
|
This PR fixes a bug where the docker network pause container would not be stopped and removed in the case where a node is restarted, the alloc is moved to another node, the node comes back up. See the issue below for full repro conditions. Basically in the DestroyNetwork PostRun hook we would depend on the NetworkIsolationSpec field not being nil - which is only the case if the Client stays alive all the way from network creation to network teardown. If the node is rebooted we lose that state and previously would not be able to find the pause container to remove. Now, we manually find the pause container by scanning them and looking for the associated allocID. Fixes #17299
295d5ce
to
5f87246
Compare
if spec == nil { | ||
return nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm having trouble following this. If we return here, how does the Docker driver ever get the DestroyNetwork
call so that it can run the findPauseContainer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path only applies to the other kind of docker managed [driver] network where the create/destroy happen over RPC. In the group network case (where we use a pause container) the create/destroy are invoked directly* by the client on the docker driver implementation.
- where directly means through the interface handle backed by the creation of the handle through the RPC dispatch.
This code is really difficult to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR fixes a bug where the docker network pause container would not be
stopped and removed in the case where a node is restarted, the alloc is
moved to another node, the node comes back up. See the issue below for
full repro conditions.
Basically in the DestroyNetwork PostRun hook we would depend on the
NetworkIsolationResource field not being nil - which is only the case
if the Client stays alive all the way from network creation to network
teardown. If the node is rebooted we lose that state and previously
would not be able to find the pause container to remove. Now, we manually
find the pause container by scanning them and looking for the associated
allocID.
Fixes #17299