-
Notifications
You must be signed in to change notification settings - Fork 672
Overlay network stops working #3773
Comments
@oscar-martin thanks for reporting the issue, and for the great troubleshooting that was done to figure the root cause.
agree. go routine handling the misses should be running or recreated
While |
We initially thought that. Then we realized that although it did fallback to And yes, the As a separated note, we also saw that |
We experience the same issues in all our clusters, including production. But I wasn't that good in debugging the issue. |
We are lately suffering from this more often (twice/thrice per day)...so we applied a workaround to the weave DaemonSet by adding a
We know that when all of peer's connections change to We consider that reaching 15 means weave is misbehaving locally so we need to recover networking asap (that's the reason we do not wait until all connections are using |
Is there any progress? |
Since last week we are also noticing this issue in our production environment. We didn't implement the livenessProbe yet but did a manual restart of all the weave pods this morning and now validating if it is back up&running. Could somebody share the latest on this? |
@timboven are you sure you have the exact same symptoms? |
If anyone here has a message in the logs beginning |
@oscar-martin that should lead to https://github.com/weaveworks/weave/blob/v2.6.0/router/fastdp.go#L1104 which will exit the program. So unless I'm missing something, this cannot be the problem. Please can you attach your logs. |
@bboreham You're right; I overlooked that I will provide logs as soon as I cleanse them. Is it enough to provide the logs from one of the nodes? |
@oscar-martin sure, one node that showed the "netlink not being read" symptom should be enough. If it's easier you can email logs to support at weave dot works; reference this issue. |
Hi, I think we are facing same issue with our setup. $ cat /proc/net/netlink | grep "326287820|Drop" $ top -p 2892 $ ls -l /proc/2892/fd | grep 326287820 $ ss -f netlink | grep genl Drops count keep increasing on problematic node. $ cat /proc/net/netlink | grep "326287820|Drop" After upgrading from 2.6.0 to 2.6.5. this issue was not observed fo around 2 months. But it again started appearing. We have already checked weave status which is ready.
PeerDiscovery: enabled
DefaultSubnet: 10.32.0.0/12 Note: 2 Failed connection is because two worker nodes are not reachable. If I do "ping 10.32.0.1 -I weave" from non working node, it doesn't work.(10.32.0.1 is IP address of weave interface on mater node. From working node, this works fine) $ ping 10.45.0.1 -I weave We have deployed weave-scope in our cluster. By which we get problematic nodes(Highlighted as Unmanaged). sysctl --system
After restarting the weave pod of that particular node, problem goes off. We also have a production setup with less number of nodes(3 Master, 6 Worker) where we haven't faced this issue. So I think this problem is more often when your cluster have high number of nodes. @oscar-martin: Very nice finding sir. |
We are observing this in rather small clusters with a dynamic number of nodes, so for our case, I adjusted @oscar-martin's liveness probe. Key differences of the modified version:
The connectivity issue happens rarely for us, so I have yet to see a restart due to failure of this probe.
|
We are using weave v2.6.0 as the CNI in our Kuberentes environment for some months in a cluster with more than 30 VMs. From time to time (approx once per week), containers in a VM stops being able to connect to containers in different VMs. But they can connect to containers running within the same VM.
Seen problems are
Destination Host Unreachable
,Unknown Host Exception
and related networking issues.Looking at the
weave status connections
for this VM, it says it is usingsleeve
mode with all the rest of the peers.The only way to make it work again was to restart the weave pod in that VM.
Troubleshooting
We spent a couple of days looking into it to find out the reason and this is what we found.
A netlink "socket" connected to weave process was dropping all packages:
Then, we looked for inode 70732 in the weave process:
So file descriptor 8 was being used in user space to read/write to it.
Additionally, we listed the open netlink sockets:
And found out the Rmem and the Recv-Q matches (214440).
So everything pointed out the read queue was full and new packages were dropped. The question then was what is being sent by that socket.
We forced dumping the goroutines stack traces for weave and we looked for goroutines that were reading from fd 8 (0x8). We found no one.. We tried to do the same with another VM where weave was working fine and we got this nice stack trace:
So,
odp.DatapathHandle.ConsumeMisses
was not executing in our problematic VM. We dived into the code and understood that when the datapath is not able to find a Flow for a packet, it sent that (missed) packet to user space to handle it (so a new Flow can be created for such a packet). And also understood that the goroutine in charge of receiving those packets is created here: https://github.com/weaveworks/weave/blob/v2.6.0/vendor/github.com/weaveworks/go-odp/odp/packet.go#L74.So it seems like the fact of not having that goroutine reading missing packages causes weave to not being able to create new Flow in the datapath so packets are "lost" in openvswitch module.
EDIT
This is not true (as @bboreham pointed out here)
Finally, we saw a reason why this goroutine could end: https://github.com/weaveworks/weave/blob/v2.6.0/vendor/github.com/weaveworks/go-odp/odp/netlink.go#L693-L696
So in case of an error coming out the
NetlinkSocket.Receive
the goroutine ends so nothing will then read from that socket causing weave to not processing "upcalls" from openvswitch any more./EDIT
What you expected to happen?
Somehow the aforementioned goroutine should not finalize or a new one is spawn for reading from this netlink socket again.
What happened?
This is already described in above
How to reproduce it?
We have not found ways to reproduce it. It just happens. We do not know how to "force" netlink issues that can cause the goroutine to end.
Additional information
We also tried restarting another peer to see how the new connection is handled. Watching the
weave status connections
we saw it starts withfastdp
but after the heartbeat failed (exactly 1 minute), it felt back tosleeve
.It is using
bridged_fastdp
as the bridge type and the communication between peers is unencrypted.Versions:
The text was updated successfully, but these errors were encountered: