-
Notifications
You must be signed in to change notification settings - Fork 672
weave on kubernetes - communication to other nodes no longer possible #3825
Comments
The problem reoccurred this time after 3 days. I've now enabled debug logging to get more information when it happens again. |
And happened again this morning. The debug log for the broken pod (and some other details) is here: https://gist.github.com/vernimmen-textkernel/7b99aa7c076b4458684669dea4092c3f |
And another one: https://gist.github.com/vernimmen-textkernel/a8e3959f2c856ca9519c05640eba7ab0 |
Where you get a message like this:
we need the logs of the other side, to see why it dropped the connection. Could you please do There are no errors or connection drops in the other two gists. |
I have thought that my networks problems are related to this issue. In my case: Weave had never uses the sleeve mode. Connection between nodes was cancalled and broken because the iptables rules were incorrect. It took me a lot of time to solve and understand this. |
Just checked. In my case (KOPS managed cluster) kube-proxy manifest have xtables.lock-File mount. Same for weave. |
for my case if I would be able to check connectivity (from the weave pod) to the cluster services I would be able to setup liveness check. |
Hi, The weave daemonset contains the following mounts. I have removed a lot of other lines (metadata, hostnetwork and so on).
These mounts must also available to the kube-proxy containers. (for me located at
The mounts for We had a lot of configuration issues. These issues resulted in a unstable running cluster. So I think that our problems differs a lot (i think). Maybe Here's what configurations I changed:
Next, there seems to be a bug in cgroup handling when a lot of containers started/recreated. In this case we had to change some kernel parameters. On our clusters there were no crashed since these changes. I would say: The fix works for me but I can't say this is the universal solution for all problems. I'm unable to figure out all side effects of these changes. I am sorry that i cannot help you furthermore. Edit: |
What you expected to happen?
We expect communication between pods on different kubernetes nodes not to break
What happened?
Symptoms: pods on 1 kubernetes worker node stop being able to communicate with pods on other worker nodes. All other worker nodes remain fine. To work around the problem, we delete the weave pod on the affected worker node. Once the new pod is up, everything returns to normal.
After a while (anywhere between 24 and 96 hours) the problem happens again. Sometimes with the same worker node, sometimes with a different worker node.
When looking at the connections, some or all connections are using sleeve instead of fastdp.
How to reproduce it?
It is happening about 1x per 48 hours for us currently. We do not yet have a way to trigger the problem
To try and trigger the problem we disconnected the network on one of the worker nodes for a few seconds, but that did not do anything.
Anything else we need to know?
created by kubespray 2.11
This runs in VMs on 3 hypervisors on-prem.
In my eyes the symptoms of this issue resembles #3641 and #3773
Versions:
Logs:
from this moment the communication problem started:
full logs of that worker node's weave pod are in https://gist.github.com/vernimmen-textkernel/110a8219a7ea33eeeea3997adf18bf6c
The text was updated successfully, but these errors were encountered: