-
Notifications
You must be signed in to change notification settings - Fork 665
Investigate martian packets #3327
Comments
I am seeing this issue as well |
@taemon1337 Do you observe any packet loss? |
We see lots of martian packets from the kernel and retransmits with iperf. We are running baremetal on Centos7. Our avg throughput using iperf on each node-pair is < 1Gbps, where localhost is 10 Gbps |
Could you paste martian packet logs from |
I cant actually post them directly, but here is the text line: One of our engineers decoded them and said they were ARP packets |
We are seeing the same on our new setup with a mix of bare metal and VM nodes
|
Can it be due to two nic's on bare-metal? |
@paphillon I don't see how two NIC's could cause it. What is your Weave Net IP addr range? 172.16.0.0/16? |
@brb Yes, that is Weave Net IP addr range. What we have noticed is that the martian source warnings flood specifically if one or more node is unhealthy due to network, cpu or infra related issues. But then due to this flood, it seems to slow down other nodes as well possibly due to network traffic. |
We did hit this issue again today morning and did cause an outage. kernel: IPv4: martian source 172.16.16.22 from 172.16.72.0, on dev datapath The source is coreDNS and to resolve the problem we have to restart that particular coreDNS pod. Not sure how it is linked to weave net, but interestingly we ONLY see this issue in our non-prod and prod env, while in dev we have never seen this issue. The only difference in dev is that all VM's have one NIC and uses 172.200.0.0/24 ip range for weave net, while non-prod, prod envs we have mix of vm and bare metal with 2 NICs and uses 172.16.0.0/16 ip range for weave. |
Not a lead on resolution. But couple of thoughts. It it interesting to know, it's same pattern on all the three reported cases i.e. packet is received on Since the packets are not non-routable IP's its the case of receiving an invalid IP on the interface. For e.g in case of
networking stack is not expecting packets with IP Or its case of reverse traffic going through on different interface than on which packet arrived. If you observe martian packets again, can you please make a note of the nodes on which source and destination pods are running, routes on the nodes and |
@murali-reddy - I do have some logs captured from that event except the weave report if that may help. Quick question, we are planning to have coreDNS listen on hostnetwork instead of overlay IP as the offending source address has always been associated with coreDNS pod ip. Do you think it may help? Our Dev cluster has cluster cidr 172.200.0.0/24 while stage and prod are 172.16.0.0/16. We have never encountered this issue in dev, while everything else remains the same. Do you think this can be a contributing factor? We don't have a test case yet to reproduce this issue and by itself, it's a rare event. Unfortunately, that means waiting until it happens again and cause an outage so i am trying to stay ahead of it if possible. 172.16.16.22 & 172.16.184.4 are coredns pod ips
Weave net logs from the host where martian errors were seen in the logs
Weave report - Note this was taken today, one difference noted as compared to dev env is the ip range. In dev the allocated ip range there is only one instance where the ip range ends with .0, while in stage and prod as shown below has several such instances, not sure if this is expected due to the cidr range.
|
thanks for sharing logs
Don't see any reason why this problem could be particular to coreDNS. Might happen to any pod-to-pod communication over overlay network.
On the contrary to your observation, 172.200.0.0/24 is not a RFC1918 private IP address I would expect that might cause problem. As far as i have seen martian packets are typically result of routing misconfigurations, Weave-net does very little routing configuration as it deals with L2 switching. So its hard to guess what could be the contributing factor. Even harder to reproduce unfortunately. |
Good point. Yes, you are right, this was our first dev cluster :)
Agreed. However, whenever we saw this issue, the offending pods were coreDNS and on 19th the last time we got hit by the issue, stage, as well as pod, exhibited the same problem and the pods were coreDNS, so I think there might be some correlation
Any other detail might help to troubleshoot this? |
@paphillon in the middle of your logs is a list of IPs which starts with the gateway address on the bridge, but then includes some peer IPs. I am puzzled how this comes about. To check, could you run |
@bboreham Thanks! Yes, that jumped out for me too, but didn't have much information to say if that is expected or not. Here is the o/p of the ip addr show command as you requested for the host
For another worker host, seems similar o/p.
|
Do you have that symptom? |
@bboreham - I am not sure if we have the same issue, is there a command that I can run to find out if there are same IP's assigned to a pod/node? During our setup, we did remove and readd a couple of nodes. To reproduce the martian error we tried to block all incoming and outgoing traffic on the node that is running coreDNS and by doing so we noticed the martian errors being logged. However, when the traffic was allowed back again the errors stopped so it was a partial success in reproducing the error. For now, we moved coreDNS to hostnetwork, repeated the above test and we didn't notice any martian errors in the logs. |
You can get pod IPs via I can’t think of a way to get the node gateway addresses short of visiting each node and running |
Pod IP's. Did not find node / pod IP conflict though.
Node ip's (dev weave)
I do see them contributing to the martian errors
|
@bboreham - We had a network outage today for briefly 10 mins where some nodes were not able to communicate with each other or were really slow and the result was same.. martian errors from pods running on two nodes with weave addresses. The only way to solve it was to drain those nodes and reboot. |
Same issue here. We have a node that cannot communicate with pods on other nodes, and tcp connections also fails with This issue happens 30~60+mins after host boot, and it's unpredictable. The only way to solve it is reboot. Weavenet version:
Lots of martian logs:
Full
|
I'm dealing with this issue as well. We have Kubernetes clusters on AWS, Azure, and GCP and we're only seeing the martian source errors on GCP. It's also only these clusters that occasionally have network troubles on a single node that require restarting weave to fix. The logs are similar to previous posts.
|
From time to time, in kernel logs we see:
Some execution paths in the kernel (e.g. https://github.com/torvalds/linux/blob/v4.17/net/ipv4/route.c#L1699) suggest that a martian packet can be dropped after the kernel has logged about it.
Investigate:
The text was updated successfully, but these errors were encountered: