Network plugin not informed of missing pods #14940

caseydavenport · 2015-10-01T22:00:57Z

I've noticed a case where the network plugin is not informed of pods which have gone missing. Specifically, if Docker is restarted (or the node rebooted), any containers which had been running will be stopped. The kubelet will notice this and re-create the pod infra container.

However, the network plugin will not have any knowledge that the previous infra container / pod are no longer running.

I've specifically run into this issue in the context of IPAM:

If the network plugin is performing IPAM, this can lead to leaked IP addresses, since the plugin does not know to un-assign the IP address from the previous pod.
If the plugin is using Docker's IPAM, Docker will assign the same IP address to the new infra container. From the network plugin's perspective, two pods now have the same IP address, which is a problem.

I'm not sure what the ideal fix is here. I think one option would be to declare pod tearDown as idempotent, and whenever the kubelet detects that an infra container should be re-created (was running, is now missing) it calls the network plugin tearDown on the previous container before setUp on the new infra container.

@thockin - Any thoughts?

The text was updated successfully, but these errors were encountered:

pmorie · 2015-10-03T21:19:17Z

@kubernetes/rh-networking

danwinship · 2015-10-07T14:54:10Z

I think one option would be to declare pod tearDown as idempotent, and whenever the kubelet detects that an infra container should be re-created (was running, is now missing) it calls the network plugin tearDown on the previous container before setUp on the new infra container.

That makes sense to me, although I guess the plugin could also just assume that if it gets a setUp on a pod it thinks is already running, it should tearDown the old network state first.

thockin · 2015-10-07T16:09:20Z

We have a few edge cases we should consider holistically. We'll need to
get test coverage for all of these.

Case1:
Pod created
Pod destroyed

Case 2:
Pod created
Kubelet restarts
Kubelet rediscovers pod from source
Pod destroyed

Case 3:
Pod created
Pod infra container dies
???
Pod recreated

Case 4:
Pod created
Kubelet goes down
Pod infra container dies
Kubelet comes up
Kubelet rediscovers pod from source
???
Pod recreated

Case 5:
Pod created
Kubelet goes down
Pod is destroyed at source
Kubelet comes up
???
Pod destroyed

Case 6:
Pod created
Kubelet goes down
Pod is destroyed at source
Pod infra container dies
Kubelet comes up
???
Pod destroyed

In case 6 we have an analogous problem in volumes. we don't have a pod
spec any more, because we don't have a local checkpoint, so we leave just
enough breadcrumbs around to do cleanup. In the network plugin case we
have information that a pod used to exist and what its UID is, but not its
name or namespace or docker ID - that seems to be the pathologically bad
case. If we can solve for that, case 5 is also handled.

This bug in particular seems to be about cases 3 and 4, which have enough
information if we just define the semantics we need.

I am out today, but I wanted to weigh in just a bit.

On Wed, Oct 7, 2015 at 7:54 AM, Dan Winship notifications@github.com
wrote:

I think one option would be to declare pod tearDown as idempotent, and
whenever the kubelet detects that an infra container should be re-created
(was running, is now missing) it calls the network plugin tearDown on the
previous container before setUp on the new infra container.

That makes sense to me, although I guess the plugin could also just assume
that if it gets a setUp on a pod it thinks it already running, it should
tearDown the old network state first.

—
Reply to this email directly or view it on GitHub
#14940 (comment)
.

caseydavenport · 2015-10-27T18:05:01Z

That makes sense to me, although I guess the plugin could also just assume that if it gets a setUp on a pod it thinks is already running, it should tearDown the old network state first.

The network plugin isn't necessarily going to have the state necessary to make that decision, since it is only informed of the ID for the new pod, and has no way to connect that to the old pod that is no longer running.

luke-mino-altherr · 2015-12-03T22:44:11Z

I have been thinking about a solution to this issue. At first, I was hoping there would be a simple fix. I found that if the pod infra container is removed and there are other containers that belong to that pod still up, we can extract the pod infra container ID (dockerID) from the ResovleConfPath, an attribute of the Container object created by the go-dockerclient, of the remaining containers. The problem with this solution is that it only solves cases 3 and 4, and not for cases 5 and 6, where all of the pod containers are completely destroyed.

A possible solution that may solve all cases is creating a new field on the PodStatus API object called PodInfraContainerID. Here we can set the dockerID of the pod infra container after pod infra container creation and access it later in the KillPod function of the kublet/dockertools manager when calling the teardown function of the network plugin.

@thockin does this seem like a reasonable approach? I wanted your thoughts before I went ahead and started to implement the fix.

ssergiienko · 2016-04-20T21:35:58Z

It looks like something similar happens in my #18967 but for exec network plugin. Just leave it here

thockin · 2016-07-06T06:13:23Z

What's the status on this? It's 6 months old - is it still valid?

caseydavenport · 2016-07-06T17:02:35Z

I believe this issue is still valid, but I'm not aware of anyone working on it.

For Calico we've been able to work around it for the cases we were seeing (3 and 4 above). I haven't tested 5 and 6 personally.

freehan · 2016-07-06T17:29:53Z

We are basically looking at reconciliation and disaster recovery. Currently, network plugins only has a bunch of hooks where kubelet can trigger. Need a sync loop for reconciliation or some sort. Also, we are dealing with multiple runtimes. I believe the new runtime interface can simply things. Need major rework to make this right.

freehan · 2016-07-06T18:23:25Z

If we want to do reconciliation, we have to know the current state. The best way I can think of is thru a cni Status interface.

For bridge + host-local ipam, we can cheat by cleaning up the ip checkpoint files. I think this is the only thing that maybe leaking, right? @dcbw

danwinship · 2016-07-11T14:39:14Z

If we want to do reconciliation, we have to know the current state. The best way I can think of is thru a cni Status interface.

That assumes that kubernetes knows all of the state that the plugin cares about. In OpenShift, our plugin has an "Update" method and we just call it on all pods when a reconciliation is needed (ie, when kubelet is [re]started). (We don't currently deal with the missing pod problem.)

bprashanth · 2016-10-06T23:28:59Z

This is pretty icky.

Load a node up with pods
Reboot/upgrade/do anything to the node that preserves the IPAM database
No new pods are allowed

The node doesn't seem to recover, and we haven't documented resuscitation.

bprashanth · 2016-10-07T00:25:07Z

Hmm, maybe we can't docker inspect the container, so we return without invoking tearDown? https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockertools/docker_manager.go#L1428

bprashanth · 2016-10-07T00:30:10Z

Also I think there are different cases we're conflating into this bug:

Kubelet knows about old and new container: we should always get tearDown, and always release ip even if veth,ns,pid is missing
Kubelet doesn't know (i.e it was down while pods were deleted etc): this is a little harder to address, maybe we need snapshotting or periodic gc

dcbw · 2016-10-11T20:18:21Z

Kubelet knows about old and new container: we should always get tearDown, and always release ip even if veth,ns,pid is missing

@bprashanth you can't always do that, because it's IPAM method dependent. You may need to send a DHCP release or some other operation that requires the netns around. I suppose we could amend the CNI spec to say that the IPAM plugin should release the IP without CNI_NETNS if it can, but there are certainly edge cases where that's not possible.

Kubelet doesn't know (i.e it was down while pods were deleted etc): this is a little harder to address, maybe we need snapshotting or periodic gc

Whether we can clean up correctly by storing details (like netns) somewhere so that we can tear down networking when the container is dead and kubelet restarts depends on whether the namespace is mounted somewhere.

IIRC, with CNI/kubenet+docker creates the netns and will clean up after it when the infra container goes away. So if the container goes away while kubelet is dead, we cannot get the netns on kubelet restart and we may not be able to cleanly release IPAM.

With CNI/kubenet+rkt the namespace is bind-mounted and will not be removed until explicitly removed by kubenet or GC-ed by the rkt runtime somehow. Since netns creation is under control of the rkt runtime I think we can guarantee that IPAM release happens cleanly.

bprashanth · 2016-10-30T23:34:49Z

There were 2 problems when this issue was initially filed:

If the infra container was restarted multiple times during a pod's lifetime, the network plugin didn't get a tearDown request for each exited, infra containers. It would just get one tearDown when the pod is deleted.
Even if the kubelet did (1), we have no way of finding an exited containers netns in a cross runtime way (we don't track the old pid and docker inspect won't show it on an exited container).

The state of things today is a little different:

Kubelet will deliver tearDown events for each infra container, so first problem is fixed
The second problems is still not fixed
Kubenet does an ip gc hack to work around the issue (kubenet/kubelet leaks ips on docker restart #34278)

I agree with the previous comment, we now need to either remember netns somewhere, reverse engineer it with information available on exited containers (i.e not pid), or release resources without it.

dcbw · 2016-11-17T19:36:18Z

@bprashanth how is the second (1) "kubelet will deliver a teardown even for each infra container" fixed? Can you point me to a commit that did that?

dcbw · 2016-11-17T19:37:31Z

Also, in CNI upstream we're discussing adding language that DEL should be best-effort and that even if the netns isn't present, the plugin should still clean up whatever it can including IPAM. That would get rid of a couple of blocks here, and I think is appropriate.

The major problem people have is likely IPAM leases not being GC-ed when the infra pod is gone.

The docker runtime doesn't tear down networking when GC-ing pods. rkt already does so make docker do it too. To ensure this happens, infra pods are now always GC-ed rather than gating them by containersToKeep. This prevents IPAM from leaking when the pod gets killed for some reason outside kubelet (like docker restart) or when pods are killed while kubelet isn't running. Fixes: kubernetes#14940 Related: kubernetes#35572

The docker runtime doesn't tear down networking when GC-ing pods. rkt already does so make docker do it too. To ensure this happens, networking is always torn down for the container even if the container itself is not deleted. This prevents IPAM from leaking when the pod gets killed for some reason outside kubelet (like docker restart) or when pods are killed while kubelet isn't running. Fixes: kubernetes#14940 Related: kubernetes#35572

Automatic merge from submit-queue (batch tested with PRs 40505, 34664, 37036, 40726, 41595) dockertools: call TearDownPod when GC-ing infra pods The docker runtime doesn't tear down networking when GC-ing pods. rkt already does so make docker do it too. To ensure this happens, infra pods are now always GC-ed rather than gating them by containersToKeep. This prevents IPAM from leaking when the pod gets killed for some reason outside kubelet (like docker restart) or when pods are killed while kubelet isn't running. Fixes: #14940 Related: #35572

The docker runtime doesn't tear down networking when GC-ing pods. rkt already does so make docker do it too. To ensure this happens, networking is always torn down for the container even if the container itself is not deleted. This prevents IPAM from leaking when the pod gets killed for some reason outside kubelet (like docker restart) or when pods are killed while kubelet isn't running. Fixes: kubernetes#14940 Related: kubernetes#35572

caseydavenport mentioned this issue Oct 2, 2015

Handle duplicate IP assignments when using Docker IPAM projectcalico/k8s-exec-plugin#58

Merged

davidopp added team/cluster sig/node Categorizes an issue or PR as relevant to SIG Node. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 2, 2015

caseydavenport mentioned this issue Nov 17, 2015

Uprade to kubernetes 1.1 with calico-docker 0.10.0 issue projectcalico/k8s-exec-plugin#92

Closed

thockin assigned freehan and dcbw Jul 6, 2016

caseydavenport mentioned this issue Oct 6, 2016

kubenet/kubelet leaks ips on docker restart #34278

Closed

dcbw mentioned this issue Nov 17, 2016

dockertools: call TearDownPod when GC-ing infra pods #37036

Merged

k8s-github-robot closed this as completed in #37036 Feb 17, 2017

ssergiienko mentioned this issue Feb 21, 2017

Kubelet exec network plugin calls Setup Pod and TearDown pod in any order #18967

Closed

caseydavenport mentioned this issue Apr 7, 2021

CNI DEL not called on node reboot cri-o/cri-o#4727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network plugin not informed of missing pods #14940

Network plugin not informed of missing pods #14940

caseydavenport commented Oct 1, 2015

pmorie commented Oct 3, 2015

danwinship commented Oct 7, 2015

thockin commented Oct 7, 2015

caseydavenport commented Oct 27, 2015

luke-mino-altherr commented Dec 3, 2015

ssergiienko commented Apr 20, 2016

thockin commented Jul 6, 2016

caseydavenport commented Jul 6, 2016

freehan commented Jul 6, 2016

freehan commented Jul 6, 2016

danwinship commented Jul 11, 2016

bprashanth commented Oct 6, 2016

bprashanth commented Oct 7, 2016

bprashanth commented Oct 7, 2016 •

edited

Loading

dcbw commented Oct 11, 2016

bprashanth commented Oct 30, 2016

dcbw commented Nov 17, 2016

dcbw commented Nov 17, 2016

Network plugin not informed of missing pods #14940

Network plugin not informed of missing pods #14940

Comments

caseydavenport commented Oct 1, 2015

pmorie commented Oct 3, 2015

danwinship commented Oct 7, 2015

thockin commented Oct 7, 2015

caseydavenport commented Oct 27, 2015

luke-mino-altherr commented Dec 3, 2015

ssergiienko commented Apr 20, 2016

thockin commented Jul 6, 2016

caseydavenport commented Jul 6, 2016

freehan commented Jul 6, 2016

freehan commented Jul 6, 2016

danwinship commented Jul 11, 2016

bprashanth commented Oct 6, 2016

bprashanth commented Oct 7, 2016

bprashanth commented Oct 7, 2016 • edited Loading

dcbw commented Oct 11, 2016

bprashanth commented Oct 30, 2016

dcbw commented Nov 17, 2016

dcbw commented Nov 17, 2016

bprashanth commented Oct 7, 2016 •

edited

Loading