Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network plugin not informed of missing pods #14940

Closed
caseydavenport opened this issue Oct 1, 2015 · 18 comments
Closed

Network plugin not informed of missing pods #14940

caseydavenport opened this issue Oct 1, 2015 · 18 comments
Assignees
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@caseydavenport
Copy link
Member

I've noticed a case where the network plugin is not informed of pods which have gone missing. Specifically, if Docker is restarted (or the node rebooted), any containers which had been running will be stopped. The kubelet will notice this and re-create the pod infra container.

However, the network plugin will not have any knowledge that the previous infra container / pod are no longer running.

I've specifically run into this issue in the context of IPAM:

  • If the network plugin is performing IPAM, this can lead to leaked IP addresses, since the plugin does not know to un-assign the IP address from the previous pod.
  • If the plugin is using Docker's IPAM, Docker will assign the same IP address to the new infra container. From the network plugin's perspective, two pods now have the same IP address, which is a problem.

I'm not sure what the ideal fix is here. I think one option would be to declare pod tearDown as idempotent, and whenever the kubelet detects that an infra container should be re-created (was running, is now missing) it calls the network plugin tearDown on the previous container before setUp on the new infra container.

@thockin - Any thoughts?

@davidopp davidopp added team/cluster sig/node Categorizes an issue or PR as relevant to SIG Node. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Oct 2, 2015
@pmorie
Copy link
Member

pmorie commented Oct 3, 2015

@kubernetes/rh-networking

@danwinship
Copy link
Contributor

I think one option would be to declare pod tearDown as idempotent, and whenever the kubelet detects that an infra container should be re-created (was running, is now missing) it calls the network plugin tearDown on the previous container before setUp on the new infra container.

That makes sense to me, although I guess the plugin could also just assume that if it gets a setUp on a pod it thinks is already running, it should tearDown the old network state first.

@thockin
Copy link
Member

thockin commented Oct 7, 2015

We have a few edge cases we should consider holistically. We'll need to
get test coverage for all of these.

Case1:
Pod created
Pod destroyed

Case 2:
Pod created
Kubelet restarts
Kubelet rediscovers pod from source
Pod destroyed

Case 3:
Pod created
Pod infra container dies
???
Pod recreated

Case 4:
Pod created
Kubelet goes down
Pod infra container dies
Kubelet comes up
Kubelet rediscovers pod from source
???
Pod recreated

Case 5:
Pod created
Kubelet goes down
Pod is destroyed at source
Kubelet comes up
???
Pod destroyed

Case 6:
Pod created
Kubelet goes down
Pod is destroyed at source
Pod infra container dies
Kubelet comes up
???
Pod destroyed

In case 6 we have an analogous problem in volumes. we don't have a pod
spec any more, because we don't have a local checkpoint, so we leave just
enough breadcrumbs around to do cleanup. In the network plugin case we
have information that a pod used to exist and what its UID is, but not its
name or namespace or docker ID - that seems to be the pathologically bad
case. If we can solve for that, case 5 is also handled.

This bug in particular seems to be about cases 3 and 4, which have enough
information if we just define the semantics we need.

I am out today, but I wanted to weigh in just a bit.

On Wed, Oct 7, 2015 at 7:54 AM, Dan Winship notifications@github.com
wrote:

I think one option would be to declare pod tearDown as idempotent, and
whenever the kubelet detects that an infra container should be re-created
(was running, is now missing) it calls the network plugin tearDown on the
previous container before setUp on the new infra container.

That makes sense to me, although I guess the plugin could also just assume
that if it gets a setUp on a pod it thinks it already running, it should
tearDown the old network state first.


Reply to this email directly or view it on GitHub
#14940 (comment)
.

@caseydavenport
Copy link
Member Author

That makes sense to me, although I guess the plugin could also just assume that if it gets a setUp on a pod it thinks is already running, it should tearDown the old network state first.

The network plugin isn't necessarily going to have the state necessary to make that decision, since it is only informed of the ID for the new pod, and has no way to connect that to the old pod that is no longer running.

@luke-mino-altherr
Copy link

I have been thinking about a solution to this issue. At first, I was hoping there would be a simple fix. I found that if the pod infra container is removed and there are other containers that belong to that pod still up, we can extract the pod infra container ID (dockerID) from the ResovleConfPath, an attribute of the Container object created by the go-dockerclient, of the remaining containers. The problem with this solution is that it only solves cases 3 and 4, and not for cases 5 and 6, where all of the pod containers are completely destroyed.

A possible solution that may solve all cases is creating a new field on the PodStatus API object called PodInfraContainerID. Here we can set the dockerID of the pod infra container after pod infra container creation and access it later in the KillPod function of the kublet/dockertools manager when calling the teardown function of the network plugin.

@thockin does this seem like a reasonable approach? I wanted your thoughts before I went ahead and started to implement the fix.

@ssergiienko
Copy link

It looks like something similar happens in my #18967 but for exec network plugin. Just leave it here

@thockin
Copy link
Member

thockin commented Jul 6, 2016

What's the status on this? It's 6 months old - is it still valid?

@caseydavenport
Copy link
Member Author

I believe this issue is still valid, but I'm not aware of anyone working on it.

For Calico we've been able to work around it for the cases we were seeing (3 and 4 above). I haven't tested 5 and 6 personally.

@freehan
Copy link
Contributor

freehan commented Jul 6, 2016

We are basically looking at reconciliation and disaster recovery. Currently, network plugins only has a bunch of hooks where kubelet can trigger. Need a sync loop for reconciliation or some sort. Also, we are dealing with multiple runtimes. I believe the new runtime interface can simply things. Need major rework to make this right.

@freehan
Copy link
Contributor

freehan commented Jul 6, 2016

If we want to do reconciliation, we have to know the current state. The best way I can think of is thru a cni Status interface.

For bridge + host-local ipam, we can cheat by cleaning up the ip checkpoint files. I think this is the only thing that maybe leaking, right? @dcbw

@danwinship
Copy link
Contributor

If we want to do reconciliation, we have to know the current state. The best way I can think of is thru a cni Status interface.

That assumes that kubernetes knows all of the state that the plugin cares about. In OpenShift, our plugin has an "Update" method and we just call it on all pods when a reconciliation is needed (ie, when kubelet is [re]started). (We don't currently deal with the missing pod problem.)

@bprashanth
Copy link
Contributor

This is pretty icky.

Load a node up with pods
Reboot/upgrade/do anything to the node that preserves the IPAM database
No new pods are allowed

The node doesn't seem to recover, and we haven't documented resuscitation.

@bprashanth
Copy link
Contributor

Hmm, maybe we can't docker inspect the container, so we return without invoking tearDown? https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/dockertools/docker_manager.go#L1428

@bprashanth
Copy link
Contributor

bprashanth commented Oct 7, 2016

Also I think there are different cases we're conflating into this bug:

  1. Kubelet knows about old and new container: we should always get tearDown, and always release ip even if veth,ns,pid is missing
  2. Kubelet doesn't know (i.e it was down while pods were deleted etc): this is a little harder to address, maybe we need snapshotting or periodic gc

@dcbw
Copy link
Member

dcbw commented Oct 11, 2016

  1. Kubelet knows about old and new container: we should always get tearDown, and always release ip even if veth,ns,pid is missing

@bprashanth you can't always do that, because it's IPAM method dependent. You may need to send a DHCP release or some other operation that requires the netns around. I suppose we could amend the CNI spec to say that the IPAM plugin should release the IP without CNI_NETNS if it can, but there are certainly edge cases where that's not possible.

  1. Kubelet doesn't know (i.e it was down while pods were deleted etc): this is a little harder to address, maybe we need snapshotting or periodic gc

Whether we can clean up correctly by storing details (like netns) somewhere so that we can tear down networking when the container is dead and kubelet restarts depends on whether the namespace is mounted somewhere.

IIRC, with CNI/kubenet+docker creates the netns and will clean up after it when the infra container goes away. So if the container goes away while kubelet is dead, we cannot get the netns on kubelet restart and we may not be able to cleanly release IPAM.

With CNI/kubenet+rkt the namespace is bind-mounted and will not be removed until explicitly removed by kubenet or GC-ed by the rkt runtime somehow. Since netns creation is under control of the rkt runtime I think we can guarantee that IPAM release happens cleanly.

@bprashanth
Copy link
Contributor

There were 2 problems when this issue was initially filed:

  1. If the infra container was restarted multiple times during a pod's lifetime, the network plugin didn't get a tearDown request for each exited, infra containers. It would just get one tearDown when the pod is deleted.
  2. Even if the kubelet did (1), we have no way of finding an exited containers netns in a cross runtime way (we don't track the old pid and docker inspect won't show it on an exited container).

The state of things today is a little different:

  1. Kubelet will deliver tearDown events for each infra container, so first problem is fixed
  2. The second problems is still not fixed
  3. Kubenet does an ip gc hack to work around the issue (kubenet/kubelet leaks ips on docker restart #34278)

I agree with the previous comment, we now need to either remember netns somewhere, reverse engineer it with information available on exited containers (i.e not pid), or release resources without it.

@dcbw
Copy link
Member

dcbw commented Nov 17, 2016

@bprashanth how is the second (1) "kubelet will deliver a teardown even for each infra container" fixed? Can you point me to a commit that did that?

@dcbw
Copy link
Member

dcbw commented Nov 17, 2016

Also, in CNI upstream we're discussing adding language that DEL should be best-effort and that even if the netns isn't present, the plugin should still clean up whatever it can including IPAM. That would get rid of a couple of blocks here, and I think is appropriate.

The major problem people have is likely IPAM leases not being GC-ed when the infra pod is gone.

dcbw added a commit to dcbw/kubernetes that referenced this issue Nov 18, 2016
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
infra pods are now always GC-ed rather than gating them by
containersToKeep.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
dcbw added a commit to dcbw/kubernetes that referenced this issue Dec 14, 2016
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
dcbw added a commit to dcbw/kubernetes that referenced this issue Dec 15, 2016
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
dcbw added a commit to dcbw/kubernetes that referenced this issue Feb 16, 2017
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
k8s-github-robot pushed a commit that referenced this issue Feb 17, 2017
Automatic merge from submit-queue (batch tested with PRs 40505, 34664, 37036, 40726, 41595)

dockertools: call TearDownPod when GC-ing infra pods

The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
infra pods are now always GC-ed rather than gating them by
containersToKeep.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: #14940
Related: #35572
ahakanbaba pushed a commit to ahakanbaba/kubernetes that referenced this issue Feb 17, 2017
The docker runtime doesn't tear down networking when GC-ing pods.
rkt already does so make docker do it too. To ensure this happens,
networking is always torn down for the container even if the
container itself is not deleted.

This prevents IPAM from leaking when the pod gets killed for
some reason outside kubelet (like docker restart) or when pods
are killed while kubelet isn't running.

Fixes: kubernetes#14940
Related: kubernetes#35572
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Higher priority than priority/awaiting-more-evidence. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

10 participants