Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Investigate martian packets #3327

Open
brb opened this issue Jun 13, 2018 · 25 comments
Open

Investigate martian packets #3327

brb opened this issue Jun 13, 2018 · 25 comments

Comments

@brb
Copy link
Contributor

brb commented Jun 13, 2018

From time to time, in kernel logs we see:

[ 2717.970445] IPv4: martian source 10.32.0.5 from 10.32.0.2, on dev datapath
[ 2717.970446] ll header: 00000000: ff ff ff ff ff ff 06 69 66 10 db 3f 08 06

Some execution paths in the kernel (e.g. https://github.com/torvalds/linux/blob/v4.17/net/ipv4/route.c#L1699) suggest that a martian packet can be dropped after the kernel has logged about it.

Investigate:

  1. Why kernel logs about martian packets only sometimes.
  2. What is impact of martian packets on Weave network stability.
@taemon1337
Copy link

I am seeing this issue as well

@brb
Copy link
Contributor Author

brb commented Jul 17, 2018

@taemon1337 Do you observe any packet loss?

@taemon1337
Copy link

We see lots of martian packets from the kernel and retransmits with iperf. We are running baremetal on Centos7. Our avg throughput using iperf on each node-pair is < 1Gbps, where localhost is 10 Gbps

@brb
Copy link
Contributor Author

brb commented Jul 17, 2018

Could you paste martian packet logs from dmesg?

@taemon1337
Copy link

I cant actually post them directly, but here is the text line:
IPv4: martian source 10.44.0.2 from 10.40.0.0, on dev datapath

One of our engineers decoded them and said they were ARP packets

@paphillon
Copy link

We are seeing the same on our new setup with a mix of bare metal and VM nodes

Aug  7 18:10:15 node655 kernel: IPv4: martian source 172.16.64.2 from 172.16.248.0, on dev datapath
Aug  7 18:10:15 node655 kernel: ll header: 00000000: ff ff ff ff ff ff be 9f b5 37 a7 72 08 06        .........7.r..
Aug  7 18:10:16 node655 kernel: IPv4: martian source 172.16.64.2 from 172.16.248.0, on dev datapath

@paphillon
Copy link

Can it be due to two nic's on bare-metal?

@brb
Copy link
Contributor Author

brb commented Sep 21, 2018

@paphillon I don't see how two NIC's could cause it. What is your Weave Net IP addr range? 172.16.0.0/16?

@paphillon
Copy link

@brb Yes, that is Weave Net IP addr range. What we have noticed is that the martian source warnings flood specifically if one or more node is unhealthy due to network, cpu or infra related issues. But then due to this flood, it seems to slow down other nodes as well possibly due to network traffic.

@paphillon
Copy link

We did hit this issue again today morning and did cause an outage.
Around 2:45 am PT there was a network outage for 4 mins, services recovered quickly as the delay in DNS lookup was intermittent, further debugging we saw below errors being logged almost continuously in the OS logs.

kernel: IPv4: martian source 172.16.16.22 from 172.16.72.0, on dev datapath
kernel: ll header: 00000000: ff ff ff ff ff ff 22 83 2d 0c 5a dc 08 06 ......".-.Z...

The source is coreDNS and to resolve the problem we have to restart that particular coreDNS pod. Not sure how it is linked to weave net, but interestingly we ONLY see this issue in our non-prod and prod env, while in dev we have never seen this issue.

The only difference in dev is that all VM's have one NIC and uses 172.200.0.0/24 ip range for weave net, while non-prod, prod envs we have mix of vm and bare metal with 2 NICs and uses 172.16.0.0/16 ip range for weave.

@paphillon
Copy link

@brb / @bboreham - Any leads on resolving this issue? These are ARP packets and did not get any lead by searching coreDNS related issues.

@murali-reddy
Copy link
Contributor

Not a lead on resolution. But couple of thoughts.

It it interesting to know, it's same pattern on all the three reported cases i.e. packet is received on datapath device and its ARP traffic.

Since the packets are not non-routable IP's its the case of receiving an invalid IP on the interface. For e.g in case of

kernel: IPv4: martian source 172.16.16.22 from 172.16.72.0, on dev datapath

networking stack is not expecting packets with IP 172.16.72.0 on datapath device.

Or its case of reverse traffic going through on different interface than on which packet arrived.

If you observe martian packets again, can you please make a note of the nodes on which source and destination pods are running, routes on the nodes and weave report of both the nodes?

@paphillon
Copy link

paphillon commented Mar 27, 2019

@murali-reddy - I do have some logs captured from that event except the weave report if that may help.

Quick question, we are planning to have coreDNS listen on hostnetwork instead of overlay IP as the offending source address has always been associated with coreDNS pod ip. Do you think it may help?

Our Dev cluster has cluster cidr 172.200.0.0/24 while stage and prod are 172.16.0.0/16. We have never encountered this issue in dev, while everything else remains the same. Do you think this can be a contributing factor?

We don't have a test case yet to reproduce this issue and by itself, it's a rare event. Unfortunately, that means waiting until it happens again and cause an outage so i am trying to stay ahead of it if possible.

172.16.16.22 & 172.16.184.4 are coredns pod ips

Mar 19 20:45:22 or1dra658 kernel: IPv4: martian source 172.16.16.22 from 172.16.72.0, on dev datapath
Mar 19 20:45:22 or1dra658 kernel: ll header: 00000000: ff ff ff ff ff ff 22 83 2d 0c 5a dc 08 06        ......".-.Z...
Mar 19 20:45:22 or1dra658 kernel: IPv4: martian source 172.16.184.4 from 172.16.216.0, on dev datapath
Mar 19 20:45:22 or1dra658 kernel: ll header: 00000000: ff ff ff ff ff ff ea a4 90 36 8c e4 08 06        .........6....
Mar 19 20:45:23 or1dra658 kernel: IPv4: martian source 172.16.16.22 from 172.16.72.0, on dev datapath
Mar 19 20:45:23 or1dra658 kernel: ll header: 00000000: ff ff ff ff ff ff 22 83 2d 0c 5a dc 08 06        ......".-.Z...
Mar 19 20:45:23 or1dra658 kernel: IPv4: martian source 172.16.184.4 from 172.16.216.0, on dev datapath
Mar 19 20:45:23 or1dra658 kernel: ll header: 00000000: ff ff ff ff ff ff ea a4 90 36 8c e4 08 06        .........6....
Mar 19 20:45:26 or1dra658 kernel: IPv4: martian source 172.16.16.22 from 172.16.72.0, on dev datapath
Mar 19 20:45:26 or1dra658 kernel: ll header: 00000000: ff ff ff ff ff ff 22 83 2d 0c 5a dc 08 06        ......".-.Z...
Mar 19 20:45:26 or1dra658 kernel: IPv4: martian source 172.16.184.4 from 172.16.216.0, on dev datapath
Mar 19 20:45:26 or1dra658 kernel: ll header: 00000000: ff ff ff ff ff ff ea a4 90 36 8c e4 08 06        .........6....

Weave net logs from the host where martian errors were seen in the logs

DEBU: 2019/03/19 18:51:15.150194 [kube-peers] Checking peer "ea:a4:90:36:8c:e4" against list &{[{7e:5f:8a:45:61:51 xy1010050035011.corp.xy.com} {96:3b:8e:25:64:6a xy1010050035012.corp.xy.com} {46:c1:59:91:d0:ea xy1010050035007.corp.xy.com} {e6:18:fc:c6:c0:c0 xy1010050035010.corp.xy.com} {12:f2:33:54:1b:a6 xy1010050035009.corp.xy.com} {66:1a:53:fc:8e:ea xy1010050035014.corp.xy.com} {be:9f:b5:37:a7:72 xy1dra655.corp.xy.com} {92:8b:90:4e:79:09 xy1dra656.corp.xy.com} {b2:b8:10:72:3f:74 xy1dra657.corp.xy.com} {ea:a4:90:36:8c:e4 xy1dra658.corp.xy.com} {2e:3e:0c:1a:b3:61 xy1010050034200.corp.xy.com} {22:83:2d:0c:5a:dc xy1010050034204.corp.xy.com} {aa:de:18:e8:93:8f xy1010050034205.corp.xy.com} {a6:9a:26:c8:46:c6 xy1010050035008.corp.xy.com} {12:30:75:29:cd:92 xy1010050035019.corp.xy.com}]}
INFO: 2019/03/19 18:51:18.731217 Command line options: map[no-dns:true port:6783 host-root:/host http-addr:127.0.0.1:6784 ipalloc-init:consensus=15 metrics-addr:0.0.0.0:6782 db-prefix:/weavedb/weave-net name:ea:a4:90:36:8c:e4 conn-limit:40 expect-npc:true mtu:1337 datapath:datapath docker-api: ipalloc-range:172.16.0.0/16 nickname:xy1dra658.corp.xy.com]
INFO: 2019/03/19 18:51:18.731344 weave  2.5.1
INFO: 2019/03/19 18:51:20.125085 Re-exposing 172.16.216.0/16 on bridge "weave"
INFO: 2019/03/19 18:51:20.325415 Bridge type is bridged_fastdp
INFO: 2019/03/19 18:51:20.325451 Communication between peers is unencrypted.
INFO: 2019/03/19 18:51:20.537983 Our name is ea:a4:90:36:8c:e4(xy1dra658.corp.xy.com)
INFO: 2019/03/19 18:51:20.538062 Launch detected - using supplied peer list: [10.50.34.200 10.50.34.204 10.50.34.205 10.50.35.7 10.50.35.8 10.50.35.9 10.50.35.10 10.50.35.11 10.50.35.12 10.50.35.14 10.50.35.19 10.50.34.196 10.50.34.197 10.50.34.198 10.50.34.199]
INFO: 2019/03/19 18:51:20.628649 Checking for pre-existing addresses on weave bridge
INFO: 2019/03/19 18:51:20.629132 weave bridge has address 172.16.216.0/16
INFO: 2019/03/19 18:51:22.024348 Found address 172.16.88.8/16 for ID _
INFO: 2019/03/19 18:51:22.025330 Found address 172.16.88.8/16 for ID _
INFO: 2019/03/19 18:51:22.026233 Found address 172.16.88.12/16 for ID _
INFO: 2019/03/19 18:51:22.026686 Found address 172.16.88.12/16 for ID _
INFO: 2019/03/19 18:51:22.028593 Found address 172.16.88.12/16 for ID _
INFO: 2019/03/19 18:51:22.127137 Found address 172.16.88.10/16 for ID _
INFO: 2019/03/19 18:51:22.127606 Found address 172.16.88.10/16 for ID _
INFO: 2019/03/19 18:51:22.128069 Found address 172.16.88.10/16 for ID _
INFO: 2019/03/19 18:51:22.129259 [allocator ea:a4:90:36:8c:e4] Initialising with persisted data
INFO: 2019/03/19 18:51:22.129470 Sniffing traffic on datapath (via ODP)
INFO: 2019/03/19 18:51:22.130327 ->[10.50.34.199:6783] attempting connection
INFO: 2019/03/19 18:51:22.130389 ->[10.50.34.197:6783] attempting connection
INFO: 2019/03/19 18:51:22.130513 ->[10.50.34.205:6783] attempting connection
INFO: 2019/03/19 18:51:22.130693 ->[10.50.35.14:6783] attempting connection
INFO: 2019/03/19 18:51:22.130876 ->[10.50.35.19:6783] attempting connection
INFO: 2019/03/19 18:51:22.130978 ->[10.50.34.200:6783] attempting connection
INFO: 2019/03/19 18:51:22.131138 ->[10.50.35.9:6783] attempting connection
INFO: 2019/03/19 18:51:22.131277 ->[10.50.35.10:6783] attempting connection
INFO: 2019/03/19 18:51:22.131421 ->[10.50.35.8:6783] attempting connection
INFO: 2019/03/19 18:51:22.131548 ->[10.50.35.7:6783] attempting connection
INFO: 2019/03/19 18:51:22.131649 ->[10.50.34.197:6783|92:8b:90:4e:79:09(xy1dra656.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:22.131754 ->[10.50.34.204:6783] attempting connection
INFO: 2019/03/19 18:51:22.131833 overlay_switch ->[92:8b:90:4e:79:09(xy1dra656.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:22.131899 ->[10.50.35.12:6783] attempting connection
INFO: 2019/03/19 18:51:22.131938 ->[10.50.34.205:6783|aa:de:18:e8:93:8f(xy1010050034205.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:22.132057 overlay_switch ->[aa:de:18:e8:93:8f(xy1010050034205.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:22.132075 ->[10.50.34.198:6783] attempting connection
INFO: 2019/03/19 18:51:22.132233 ->[10.50.35.11:6783] attempting connection
INFO: 2019/03/19 18:51:22.132354 ->[10.50.34.196:6783] attempting connection
INFO: 2019/03/19 18:51:22.132521 ->[10.50.34.199:54990] connection accepted
INFO: 2019/03/19 18:51:22.132712 ->[10.50.34.197:6783|92:8b:90:4e:79:09(xy1dra656.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:22.133359 ->[10.50.34.205:6783|aa:de:18:e8:93:8f(xy1010050034205.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:22.223552 ->[10.50.34.200:6783|2e:3e:0c:1a:b3:61(xy1010050034200.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:22.224675 overlay_switch ->[2e:3e:0c:1a:b3:61(xy1010050034200.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:22.224729 ->[10.50.34.200:6783|2e:3e:0c:1a:b3:61(xy1010050034200.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:22.623817 ->[10.50.35.10:6783|e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:22.724729 overlay_switch ->[e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:22.724900 ->[10.50.35.10:6783|e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:22.825018 ->[10.50.35.19:6783|12:30:75:29:cd:92(xy1010050035019.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:22.843799 ->[10.50.34.198:6783|b2:b8:10:72:3f:74(xy1dra657.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:22.923550 ->[10.50.35.8:6783|a6:9a:26:c8:46:c6(xy1010050035008.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:22.923660 overlay_switch ->[b2:b8:10:72:3f:74(xy1dra657.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:22.923721 overlay_switch ->[12:30:75:29:cd:92(xy1010050035019.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:22.923875 ->[10.50.34.198:6783|b2:b8:10:72:3f:74(xy1dra657.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:22.923990 Listening for HTTP control messages on 127.0.0.1:6784
INFO: 2019/03/19 18:51:22.924229 ->[10.50.34.199:6783|ea:a4:90:36:8c:e4(xy1dra658.corp.xy.com)]: connection shutting down due to error: cannot connect to ourself
INFO: 2019/03/19 18:51:22.924655 ->[10.50.35.12:6783|96:3b:8e:25:64:6a(xy1010050035012.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:22.924849 overlay_switch ->[96:3b:8e:25:64:6a(xy1010050035012.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:22.924960 ->[10.50.35.19:6783|12:30:75:29:cd:92(xy1010050035019.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:23.023412 ->[10.50.35.12:6783|96:3b:8e:25:64:6a(xy1010050035012.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:23.023497 overlay_switch ->[a6:9a:26:c8:46:c6(xy1010050035008.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:23.023781 ->[10.50.35.14:6783|66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:23.024486 ->[10.50.34.196:6783|be:9f:b5:37:a7:72(xy1dra655.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:23.024668 ->[10.50.34.199:54990|ea:a4:90:36:8c:e4(xy1dra658.corp.xy.com)]: connection shutting down due to error: cannot connect to ourself
INFO: 2019/03/19 18:51:23.024857 ->[10.50.35.8:6783|a6:9a:26:c8:46:c6(xy1010050035008.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:23.123244 Listening for metrics requests on 0.0.0.0:6782
INFO: 2019/03/19 18:51:23.123387 ->[10.50.35.7:6783|46:c1:59:91:d0:ea(xy1010050035007.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:23.124046 overlay_switch ->[66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:23.223388 overlay_switch ->[be:9f:b5:37:a7:72(xy1dra655.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:23.223541 overlay_switch ->[46:c1:59:91:d0:ea(xy1010050035007.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:23.224009 ->[10.50.35.11:6783|7e:5f:8a:45:61:51(xy1010050035011.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:23.224189 overlay_switch ->[7e:5f:8a:45:61:51(xy1010050035011.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:23.224361 ->[10.50.35.14:6783|66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:23.224645 ->[10.50.34.204:6783|22:83:2d:0c:5a:dc(xy1010050034204.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:23.224786 overlay_switch ->[22:83:2d:0c:5a:dc(xy1010050034204.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:23.323225 ->[10.50.35.9:6783|12:f2:33:54:1b:a6(xy1010050035009.corp.xy.com)]: connection ready; using protocol version 2
INFO: 2019/03/19 18:51:23.323312 overlay_switch ->[12:f2:33:54:1b:a6(xy1010050035009.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:23.323388 ->[10.50.34.196:6783|be:9f:b5:37:a7:72(xy1dra655.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:23.323601 ->[10.50.35.7:6783|46:c1:59:91:d0:ea(xy1010050035007.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:23.323803 ->[10.50.35.11:6783|7e:5f:8a:45:61:51(xy1010050035011.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:23.324028 ->[10.50.34.204:6783|22:83:2d:0c:5a:dc(xy1010050034204.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:23.324263 ->[10.50.35.9:6783|12:f2:33:54:1b:a6(xy1010050035009.corp.xy.com)]: connection added (new peer)
INFO: 2019/03/19 18:51:23.626959 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:23.627063 overlay_switch ->[92:8b:90:4e:79:09(xy1dra656.corp.xy.com)] using sleeve
INFO: 2019/03/19 18:51:23.627142 ->[10.50.34.197:6783|92:8b:90:4e:79:09(xy1dra656.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:23.637323 overlay_switch ->[92:8b:90:4e:79:09(xy1dra656.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:23.723344 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=3.10.0-957.1.3.el7.x86_64&flag_kubernetes-cluster-size=15&flag_kubernetes-cluster-uid=1ab5e380-7590-11e8-a373-0050568482ec&flag_kubernetes-version=v1.12.3&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&flag_network=fastdp&os=linux&signature=oshZeAJcD3hsHmfEuqAQ3CbnZbUniZGyv9wSo5g8U%2BE%3D&version=2.5.1: read tcp 10.50.34.199:56210->216.58.195.83:443: read: connection reset by peer
INFO: 2019/03/19 18:51:23.725645 ->[10.50.34.205:6783|aa:de:18:e8:93:8f(xy1010050034205.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:23.725964 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:23.726282 sleeve ->[10.50.34.205:6783|aa:de:18:e8:93:8f(xy1010050034205.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:23.727902 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:23.728003 overlay_switch ->[2e:3e:0c:1a:b3:61(xy1010050034200.corp.xy.com)] using sleeve
INFO: 2019/03/19 18:51:23.728035 ->[10.50.34.200:6783|2e:3e:0c:1a:b3:61(xy1010050034200.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:23.823447 overlay_switch ->[2e:3e:0c:1a:b3:61(xy1010050034200.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:23.824210 overlay_switch ->[e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)] using sleeve
INFO: 2019/03/19 18:51:23.824418 ->[10.50.35.10:6783|e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:23.824508 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:23.825641 ->[10.50.35.19:6783|12:30:75:29:cd:92(xy1010050035019.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:23.826208 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:23.923599 ->[10.50.35.12:6783|96:3b:8e:25:64:6a(xy1010050035012.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:23.923974 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:24.024527 sleeve ->[10.50.34.197:6783|92:8b:90:4e:79:09(xy1dra656.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.027239 ->[10.50.35.8:6783|a6:9a:26:c8:46:c6(xy1010050035008.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:24.027604 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:24.028360 overlay_switch ->[b2:b8:10:72:3f:74(xy1dra657.corp.xy.com)] using sleeve
INFO: 2019/03/19 18:51:24.029067 sleeve ->[10.50.34.200:6783|2e:3e:0c:1a:b3:61(xy1010050034200.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.123604 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:24.123613 ->[10.50.34.198:6783|b2:b8:10:72:3f:74(xy1dra657.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:24.123996 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:24.124208 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:24.124916 sleeve ->[10.50.35.12:6783|96:3b:8e:25:64:6a(xy1010050035012.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.125764 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:24.125868 overlay_switch ->[22:83:2d:0c:5a:dc(xy1010050034204.corp.xy.com)] using sleeve
INFO: 2019/03/19 18:51:24.134335 overlay_switch ->[22:83:2d:0c:5a:dc(xy1010050034204.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:24.134577 ->[10.50.35.9:6783|12:f2:33:54:1b:a6(xy1010050035009.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:24.134907 sleeve ->[10.50.35.8:6783|a6:9a:26:c8:46:c6(xy1010050035008.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.223266 sleeve ->[10.50.35.10:6783|e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.223287 overlay_switch ->[e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:24.223648 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:24.223940 ->[10.50.34.196:6783|be:9f:b5:37:a7:72(xy1dra655.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:24.224010 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:24.224134 overlay_switch ->[7e:5f:8a:45:61:51(xy1010050035011.corp.xy.com)] using sleeve
INFO: 2019/03/19 18:51:24.224192 overlay_switch ->[7e:5f:8a:45:61:51(xy1010050035011.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:24.225167 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2019/03/19 18:51:24.225791 overlay_switch ->[b2:b8:10:72:3f:74(xy1dra657.corp.xy.com)] using fastdp
INFO: 2019/03/19 18:51:24.226528 sleeve ->[10.50.34.196:6783|be:9f:b5:37:a7:72(xy1dra655.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.226905 sleeve ->[10.50.35.7:6783|46:c1:59:91:d0:ea(xy1010050035007.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.227270 sleeve ->[10.50.34.204:6783|22:83:2d:0c:5a:dc(xy1010050034204.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.227564 sleeve ->[10.50.35.14:6783|66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.227851 sleeve ->[10.50.35.19:6783|12:30:75:29:cd:92(xy1010050035019.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.228260 sleeve ->[10.50.35.11:6783|7e:5f:8a:45:61:51(xy1010050035011.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.228554 sleeve ->[10.50.35.9:6783|12:f2:33:54:1b:a6(xy1010050035009.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.323339 sleeve ->[10.50.34.198:6783|b2:b8:10:72:3f:74(xy1dra657.corp.xy.com)]: Effective MTU verified at 1438
INFO: 2019/03/19 18:51:24.323346 ->[10.50.35.7:6783|46:c1:59:91:d0:ea(xy1010050035007.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:24.323782 ->[10.50.34.204:6783|22:83:2d:0c:5a:dc(xy1010050034204.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:24.423323 ->[10.50.35.14:6783|66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:24.423750 ->[10.50.35.11:6783|7e:5f:8a:45:61:51(xy1010050035011.corp.xy.com)]: connection fully established
INFO: 2019/03/19 18:51:24.611362 Discovered remote MAC aa:de:18:e8:93:8f at aa:de:18:e8:93:8f(xy1010050034205.corp.xy.com)
INFO: 2019/03/19 18:51:24.823393 [kube-peers] Added myself to peer list &{[{7e:5f:8a:45:61:51 xy1010050035011.corp.xy.com} {96:3b:8e:25:64:6a xy1010050035012.corp.xy.com} {46:c1:59:91:d0:ea xy1010050035007.corp.xy.com} {e6:18:fc:c6:c0:c0 xy1010050035010.corp.xy.com} {12:f2:33:54:1b:a6 xy1010050035009.corp.xy.com} {66:1a:53:fc:8e:ea xy1010050035014.corp.xy.com} {be:9f:b5:37:a7:72 xy1dra655.corp.xy.com} {92:8b:90:4e:79:09 xy1dra656.corp.xy.com} {b2:b8:10:72:3f:74 xy1dra657.corp.xy.com} {ea:a4:90:36:8c:e4 xy1dra658.corp.xy.com} {2e:3e:0c:1a:b3:61 xy1010050034200.corp.xy.com} {22:83:2d:0c:5a:dc xy1010050034204.corp.xy.com} {aa:de:18:e8:93:8f xy1010050034205.corp.xy.com} {a6:9a:26:c8:46:c6 xy1010050035008.corp.xy.com} {12:30:75:29:cd:92 xy1010050035019.corp.xy.com}]}
DEBU: 2019/03/19 18:51:24.929614 [kube-peers] Nodes that have disappeared: map[]
172.16.216.0
10.50.34.200
10.50.34.204
10.50.34.205
10.50.35.7
10.50.35.8
10.50.35.9
10.50.35.10
10.50.35.11
10.50.35.12
10.50.35.14
10.50.35.19
10.50.34.196
10.50.34.197
10.50.34.198
10.50.34.199
DEBU: 2019/03/19 18:51:26.131401 registering for updates for node delete events
INFO: 2019/03/19 18:51:31.844403 Discovered remote MAC 22:83:2d:0c:5a:dc at 22:83:2d:0c:5a:dc(xy1010050034204.corp.xy.com)
INFO: 2019/03/19 18:51:31.908327 Discovered remote MAC ba:99:46:b1:ed:f1 at be:9f:b5:37:a7:72(xy1dra655.corp.xy.com)
INFO: 2019/03/19 18:51:40.832128 Discovered remote MAC d2:4e:52:b8:eb:fc at e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)
INFO: 2019/03/19 18:56:50.980959 Discovered remote MAC 0a:16:28:80:b5:3d at 46:c1:59:91:d0:ea(xy1010050035007.corp.xy.com)
INFO: 2019/03/19 19:03:28.694054 Discovered remote MAC 12:30:75:29:cd:92 at 12:30:75:29:cd:92(xy1010050035019.corp.xy.com)
INFO: 2019/03/19 19:05:13.250396 Discovered remote MAC 86:d8:3f:2f:31:71 at a6:9a:26:c8:46:c6(xy1010050035008.corp.xy.com)
INFO: 2019/03/19 19:05:15.092638 Discovered remote MAC be:9f:b5:37:a7:72 at be:9f:b5:37:a7:72(xy1dra655.corp.xy.com)
INFO: 2019/03/19 19:05:15.605427 Discovered remote MAC 92:8b:90:4e:79:09 at 92:8b:90:4e:79:09(xy1dra656.corp.xy.com)
INFO: 2019/03/19 19:05:17.258019 Discovered remote MAC 2e:3e:0c:1a:b3:61 at 2e:3e:0c:1a:b3:61(xy1010050034200.corp.xy.com)
INFO: 2019/03/19 19:05:17.438399 Discovered remote MAC b2:b8:10:72:3f:74 at b2:b8:10:72:3f:74(xy1dra657.corp.xy.com)
INFO: 2019/03/19 19:06:59.112015 Discovered remote MAC 7e:5f:8a:45:61:51 at 7e:5f:8a:45:61:51(xy1010050035011.corp.xy.com)
INFO: 2019/03/19 19:08:15.290517 Discovered remote MAC 66:1a:53:fc:8e:ea at 66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)
INFO: 2019/03/19 19:08:18.693464 Discovered remote MAC 96:3b:8e:25:64:6a at 96:3b:8e:25:64:6a(xy1010050035012.corp.xy.com)
INFO: 2019/03/19 19:08:33.505661 Discovered remote MAC 12:f2:33:54:1b:a6 at 12:f2:33:54:1b:a6(xy1010050035009.corp.xy.com)
INFO: 2019/03/19 19:09:52.660987 Discovered remote MAC 46:c1:59:91:d0:ea at 46:c1:59:91:d0:ea(xy1010050035007.corp.xy.com)
INFO: 2019/03/19 19:10:13.305601 Discovered remote MAC e6:18:fc:c6:c0:c0 at e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)
ERRO: 2019/03/19 19:14:17.538342 Captured frame from MAC (96:3b:8e:25:64:6a) to (d2:4e:52:b8:eb:fc) associated with another peer 96:3b:8e:25:64:6a(xy1010050035012.corp.xy.com)
INFO: 2019/03/19 19:20:13.172074 Discovered remote MAC 66:1a:53:fc:8e:ea at 66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)
ERRO: 2019/03/19 19:20:13.172186 Captured frame from MAC (66:1a:53:fc:8e:ea) to (86:d8:3f:2f:31:71) associated with another peer 66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)
ERRO: 2019/03/19 19:21:20.539318 Captured frame from MAC (66:1a:53:fc:8e:ea) to (86:d8:3f:2f:31:71) associated with another peer 66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)
ERRO: 2019/03/19 19:26:20.539412 Captured frame from MAC (66:1a:53:fc:8e:ea) to (86:d8:3f:2f:31:71) associated with another peer 66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)
ERRO: 2019/03/19 19:32:10.908964 Captured frame from MAC (66:1a:53:fc:8e:ea) to (86:d8:3f:2f:31:71) associated with another peer 66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)
ERRO: 2019/03/19 19:36:20.537993 Captured frame from MAC (66:1a:53:fc:8e:ea) to (86:d8:3f:2f:31:71) associated with another peer 66:1a:53:fc:8e:ea(xy1010050035014.corp.xy.com)
INFO: 2019/03/19 20:03:15.928208 Discovered remote MAC 12:f2:33:54:1b:a6 at 12:f2:33:54:1b:a6(xy1010050035009.corp.xy.com)
INFO: 2019/03/19 20:10:02.127659 Discovered remote MAC e6:18:fc:c6:c0:c0 at e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)
ERRO: 2019/03/19 20:10:02.127746 Captured frame from MAC (e6:18:fc:c6:c0:c0) to (86:d8:3f:2f:31:71) associated with another peer e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)
ERRO: 2019/03/19 20:11:20.539166 Captured frame from MAC (e6:18:fc:c6:c0:c0) to (86:d8:3f:2f:31:71) associated with another peer e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)
ERRO: 2019/03/19 20:16:20.539213 Captured frame from MAC (e6:18:fc:c6:c0:c0) to (86:d8:3f:2f:31:71) associated with another peer e6:18:fc:c6:c0:c0(xy1010050035010.corp.xy.com)
INFO: 2019/03/19 20:20:42.215391 Discovered remote MAC 12:30:75:29:cd:92 at 12:30:75:29:cd:92(xy1010050035019.corp.xy.com)
INFO: 2019/03/19 20:21:36.460277 Discovered remote MAC b2:b8:10:72:3f:74 at b2:b8:10:72:3f:74(xy1dra657.corp.xy.com)

Weave report - Note this was taken today, one difference noted as compared to dev env is the ip range. In dev the allocated ip range there is only one instance where the ip range ends with .0, while in stage and prod as shown below has several such instances, not sure if this is expected due to the cidr range.

IPRange|Size|Host
172.16.0.0|2|xy1010050035014.corp.xy.com
172.16.0.2|1|xy1010050035019.corp.xy.com
172.16.0.3|4093|xy1010050035014.corp.xy.com
172.16.16.0|4096|xy1010050035010.corp.xy.com
172.16.32.0|1|xy1010050035009.corp.xy.com
172.16.32.1|2047|xy1010050035014.corp.xy.com
172.16.40.0|2048|xy1010050034200.corp.xy.com
172.16.48.0|4096|xy1010050035009.corp.xy.com
172.16.64.0|1|xy1010050035014.corp.xy.com
172.16.64.1|2047|xy1010050035007.corp.xy.com
172.16.72.0|2048|xy1010050034204.corp.xy.com
172.16.80.0|1|xy1010050034200.corp.xy.com
172.16.80.1|2047|xy1010050035014.corp.xy.com
172.16.88.0|2048|xy1dra658.corp.xy.com
172.16.96.0|2048|xy1010050035014.corp.xy.com
172.16.104.0|2048|xy1dra656.corp.xy.com
172.16.112.0|1|xy1010050035014.corp.xy.com
172.16.112.1|2047|xy1010050035007.corp.xy.com
172.16.120.0|1|xy1dra656.corp.xy.com
172.16.120.1|2047|xy1010050035014.corp.xy.com
172.16.128.0|1|xy1010050035011.corp.xy.com
172.16.128.1|2047|xy1010050035014.corp.xy.com
172.16.136.0|2048|xy1010050035008.corp.xy.com
172.16.144.0|4096|xy1dra655.corp.xy.com
172.16.160.0|4096|xy1dra657.corp.xy.com
172.16.176.0|1|xy1010050035012.corp.xy.com
172.16.176.1|2047|xy1010050035014.corp.xy.com
172.16.184.0|2048|xy1010050035019.corp.xy.com
172.16.192.0|1|xy1010050035008.corp.xy.com
172.16.192.1|6143|xy1010050035011.corp.xy.com
172.16.216.0|1|xy1dra658.corp.xy.com
172.16.216.1|2047|xy1010050035014.corp.xy.com
172.16.224.0|1|xy1010050035012.corp.xy.com
172.16.224.1|1|xy1010050035010.corp.xy.com
172.16.224.2|2046|xy1010050035012.corp.xy.com
172.16.232.0|2048|xy1010050034205.corp.xy.com
172.16.240.0|2048|xy1010050035012.corp.xy.com
172.16.248.0|1|xy1dra655.corp.xy.com
172.16.248.1|2047|xy1010050035014.corp.xy.com

@murali-reddy
Copy link
Contributor

thanks for sharing logs

Quick question, we are planning to have coreDNS listen on hostnetwork instead of overlay IP as the offending source address has always been associated with coreDNS pod ip. Do you think it may help?

Don't see any reason why this problem could be particular to coreDNS. Might happen to any pod-to-pod communication over overlay network.

Our Dev cluster has cluster cidr 172.200.0.0/24 while stage and prod are 172.16.0.0/16. We have never encountered this issue in dev, while everything else remains the same. Do you think this can be a contributing factor?

On the contrary to your observation, 172.200.0.0/24 is not a RFC1918 private IP address I would expect that might cause problem.

As far as i have seen martian packets are typically result of routing misconfigurations, Weave-net does very little routing configuration as it deals with L2 switching. So its hard to guess what could be the contributing factor. Even harder to reproduce unfortunately.

@paphillon
Copy link

On the contrary to your observation, 172.200.0.0/24 is not a RFC1918 private IP address I would expect that might cause problem.

Good point. Yes, you are right, this was our first dev cluster :)

Don't see any reason why this problem could be particular to coreDNS. Might happen to any pod-to-pod communication over overlay network.

Agreed. However, whenever we saw this issue, the offending pods were coreDNS and on 19th the last time we got hit by the issue, stage, as well as pod, exhibited the same problem and the pods were coreDNS, so I think there might be some correlation

As far as i have seen martian packets are typically result of routing misconfigurations

Any other detail might help to troubleshoot this?

@bboreham
Copy link
Contributor

@paphillon in the middle of your logs is a list of IPs which starts with the gateway address on the bridge, but then includes some peer IPs. I am puzzled how this comes about.

To check, could you run ip addr show dev weave on that host and share what comes back.

@paphillon
Copy link

@bboreham Thanks! Yes, that jumped out for me too, but didn't have much information to say if that is expected or not.

Here is the o/p of the ip addr show command as you requested for the host

ea:a4:90:36:8c:e4(xy1dra658.corp.xy.com)

10: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1337 qdisc noqueue state UP group default qlen 1000
    link/ether ea:a4:90:36:8c:e4 brd ff:ff:ff:ff:ff:ff
    inet 172.16.216.0/16 brd 172.16.255.255 scope global weave
       valid_lft forever preferred_lft forever
    inet6 fe80::e8a4:90ff:fe36:8ce4/64 scope link
       valid_lft forever preferred_lft forever

For another worker host, seems similar o/p.

10: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1337 qdisc noqueue state UP group default qlen 1000
    link/ether c2:e7:e5:c4:3c:2d brd ff:ff:ff:ff:ff:ff
    inet 172.16.148.0/16 brd 172.16.255.255 scope global weave
       valid_lft forever preferred_lft forever
    inet6 fe80::c0e7:e5ff:fec4:3c2d/64 scope link
       valid_lft forever preferred_lft forever

@paphillon
Copy link

@bboreham Do you think this issue may be related to #3620?

@bboreham
Copy link
Contributor

bboreham commented Apr 1, 2019

Do you have that symptom?

@paphillon
Copy link

@bboreham - I am not sure if we have the same issue, is there a command that I can run to find out if there are same IP's assigned to a pod/node? During our setup, we did remove and readd a couple of nodes.

To reproduce the martian error we tried to block all incoming and outgoing traffic on the node that is running coreDNS and by doing so we noticed the martian errors being logged. However, when the traffic was allowed back again the errors stopped so it was a partial success in reproducing the error. For now, we moved coreDNS to hostnetwork, repeated the above test and we didn't notice any martian errors in the logs.

@bboreham
Copy link
Contributor

bboreham commented Apr 8, 2019

You can get pod IPs via kubectl get pods -o wide.

I can’t think of a way to get the node gateway addresses short of visiting each node and running ip addr show dev weave

@paphillon
Copy link

Pod IP's. Did not find node / pod IP conflict though.

[user@xy1010050035017 ~]$ kubectl get pods -o wide --all-namespaces | grep 172.16
app-admin           	app-admin-client-0                    1/1     Running   0          4d1h   172.16.88.14    xy1dra658.corp.xy.com         <none>
app-admin           	app-admin-client-1                    1/1     Running   0          4d2h   172.16.144.21   xy1dra655.corp.xy.com         <none>
app-admin           	app-admin-client-2                    1/1     Running   0          4d2h   172.16.104.14   xy1dra656.corp.xy.com         <none>
app-core            	app-kafka-console-admin-0             1/1     Running   0          4d2h   172.16.144.22   xy1dra655.corp.xy.com         <none>
app-core            	app-monitor-6f7d4549bb-6tqhh          2/2     Running   0          4d1h   172.16.88.12    xy1dra658.corp.xy.com         <none>
app-pub            	 	app-rest-pub-0                        2/2     Running   0          4d2h   172.16.104.13   xy1dra656.corp.xy.com         <none>
app-pub             	app-rest-pub-1                        2/2     Running   0          4d1h   172.16.88.13    xy1dra658.corp.xy.com         <none>
app-pub             	app-rest-pub-2                        2/2     Running   0          4d2h   172.16.144.20   xy1dra655.corp.xy.com         <none>
kube-system           heapster-79649856bb-zhwhc               1/1     Running   0          4d3h   172.16.136.4    xy1010050035008.corp.xy.com   <none>
kube-system           kubernetes-dashboard-7b5f4695c4-nn598   1/1     Running   0          4d1h   172.16.88.11    xy1dra658.corp.xy.com         <none>
kube-system           monitoring-influxdb-54594499c5-h8658    1/1     Running   0          4d2h   172.16.16.23    xy1010050035010.corp.xy.com   <none>
kube-system           tiller-deploy-57fdc789bd-zqhwm          1/1     Running   0          4d3h   172.16.48.16    xy1010050035009.corp.xy.com   <none>
monitoring            alertmanager-7bd44bfc96-6v6h9           1/1     Running   0          4d2h   172.16.16.25    xy1010050035010.corp.xy.com   <none>
monitoring            grafana-747f9bf496-stj9s                2/2     Running   0          4d4h   172.16.64.18    xy1010050035007.corp.xy.com   <none>
monitoring            kube-state-metrics-794cbf686b-b84mc     4/4     Running   0          4d3h   172.16.48.17    xy1010050035009.corp.xy.com   <none>
monitoring            monitoring.kubewatch-6bc78f8bb4-m4f7f   1/1     Running   0          4d2h   172.16.16.24    xy1010050035010.corp.xy.com   <none>
monitoring            node-problem-detector-6c68w             1/1     Running   3          124d   172.16.64.17    xy1010050035007.corp.xy.com   <none>
monitoring            node-problem-detector-7qllk             1/1     Running   3          124d   172.16.40.3     xy1010050034200.corp.xy.com   <none>
monitoring            node-problem-detector-c4sjz             1/1     Running   5          124d   172.16.160.6    xy1dra657.corp.xy.com         <none>
monitoring            node-problem-detector-d2pll             1/1     Running   3          124d   172.16.192.6    xy1010050035011.corp.xy.com   <none>
monitoring            node-problem-detector-gwzdv             1/1     Running   4          124d   172.16.72.4     xy1010050034204.corp.xy.com   <none>
monitoring            node-problem-detector-k7whw             1/1     Running   3          124d   172.16.224.7    xy1010050035012.corp.xy.com   <none>
monitoring            node-problem-detector-ks7qd             1/1     Running   3          124d   172.16.232.4    xy1010050034205.corp.xy.com   <none>
monitoring            node-problem-detector-n5k9j             1/1     Running   5          124d   172.16.88.10    xy1dra658.corp.xy.com         <none>
monitoring            node-problem-detector-nmv5c             1/1     Running   6          124d   172.16.104.12   xy1dra656.corp.xy.com         <none>
monitoring            node-problem-detector-nnbdq             1/1     Running   4          124d   172.16.16.22    xy1010050035010.corp.xy.com   <none>
monitoring            node-problem-detector-nx5gj             1/1     Running   3          124d   172.16.48.14    xy1010050035009.corp.xy.com   <none>
monitoring            node-problem-detector-qlj6v             1/1     Running   3          124d   172.16.0.5      xy1010050035014.corp.xy.com   <none>
monitoring            node-problem-detector-qs9g8             1/1     Running   3          124d   172.16.136.3    xy1010050035008.corp.xy.com   <none>
monitoring            node-problem-detector-w6qn5             1/1     Running   3          124d   172.16.184.4    xy1010050035019.corp.xy.com   <none>
monitoring            node-problem-detector-whr9h             1/1     Running   5          124d   172.16.144.18   xy1dra655.corp.xy.com         <none>
monitoring            prometheus-7ffbf96956-d22tx             2/2     Running   0          4d3h   172.16.144.19   xy1dra655.corp.xy.com         <none>

Node ip's (dev weave)
The only odd one's are four that are ending with .1 and one with .2, is that expected?

xy1dra655.corp.xy.com 172.16.248.0/16
xy1dra656.corp.xy.com 172.16.120.0/16
xy1dra657.corp.xy.com 172.16.160.0/16
xy1dra658.corp.xy.com 172.16.216.0/16
xy1010050034200.corp.xy.com 172.16.80.0/16
xy1010050034204.corp.xy.com 172.16.72.0/16
xy1010050034205.corp.xy.com 172.16.232.0/16
xy1010050035007.corp.xy.com 172.16.64.1/16
xy1010050035008.corp.xy.com 172.16.192.0/16
xy1010050035009.corp.xy.com 172.16.32.0/16
xy1010050035010.corp.xy.com 172.16.224.1/16
xy1010050035019.corp.xy.com 172.16.0.2/16
xy1010050035011.corp.xy.com 172.16.192.1/16
xy1010050035012.corp.xy.com 172.16.224.0/16
xy1010050035014.corp.xy.com 172.16.64.0/16

I do see them contributing to the martian errors

Apr  6 21:03:30 xy1010050035011 kernel: IPv4: martian source 172.16.64.1 from 172.16.64.19, on dev datapath
Apr  6 21:06:16 xy1010050035008 kernel: IPv4: martian source 172.16.64.19 from 172.16.192.1, on dev datapath

@paphillon
Copy link

@bboreham - We had a network outage today for briefly 10 mins where some nodes were not able to communicate with each other or were really slow and the result was same.. martian errors from pods running on two nodes with weave addresses. The only way to solve it was to drain those nodes and reboot.

@YanzheL
Copy link

YanzheL commented Mar 13, 2020

Same issue here.

We have a node that cannot communicate with pods on other nodes, and tcp connections also fails with No route to host.

This issue happens 30~60+mins after host boot, and it's unpredictable. The only way to solve it is reboot.

Weavenet version: 2.6.1
Kubernetes version: 1.17.3
Host: ubuntu 18.04 LTS
Kernel: 4.15.0-88-generic
Kubernetes serviceSubnet is 10.0.0.0/12, weavenet subnet is the default 10.32.0.0/12

ping other pods IP:

PING 10.42.0.10 (10.42.0.10) 56(84) bytes of data.
From 10.37.0.1 icmp_seq=1 Destination Host Unreachable
From 10.37.0.1 icmp_seq=2 Destination Host Unreachable

Lots of martian logs:

[13225.481477] IPv4: martian source 10.32.0.68 from 10.37.0.1, on dev datapath
[13225.485221] ll header: 00000000: ff ff ff ff ff ff a2 a9 9c 3c 7d b8 08 06        .........<}...
[13226.492261] IPv4: martian source 10.32.0.68 from 10.37.0.1, on dev datapath
[13226.495105] ll header: 00000000: ff ff ff ff ff ff a2 a9 9c 3c 7d b8 08 06        .........<}...
[13227.516292] IPv4: martian source 10.32.0.68 from 10.37.0.1, on dev datapath
[13227.520102] ll header: 00000000: ff ff ff ff ff ff a2 a9 9c 3c 7d b8 08 06        .........<}...
[13229.116534] IPv4: martian source 10.32.0.68 from 10.37.0.1, on dev datapath
[13229.118467] ll header: 00000000: ff ff ff ff ff ff a2 a9 9c 3c 7d b8 08 06        .........<}...
[13230.140568] IPv4: martian source 10.32.0.68 from 10.37.0.1, on dev datapath
[13230.144627] ll header: 00000000: ff ff ff ff ff ff a2 a9 9c 3c 7d b8 08 06        .........<}...
[13231.168156] IPv4: martian source 10.32.0.68 from 10.37.0.1, on dev datapath
[13231.170576] ll header: 00000000: ff ff ff ff ff ff a2 a9 9c 3c 7d b8 08 06        .........<}...
[13232.077072] IPv4: martian source 10.42.0.4 from 10.37.0.13, on dev eth0
[13232.077234] IPv4: martian source 10.32.0.58 from 10.37.0.13, on dev eth0
[13232.077631] IPv4: martian source 10.38.0.2 from 10.37.0.13, on dev eth0
[13232.077635] IPv4: martian source 10.41.0.23 from 10.37.0.13, on dev eth0
[13232.077639] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13232.077644] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13232.078089] IPv4: martian source 10.44.128.6 from 10.37.0.13, on dev eth0
[13232.078094] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13232.078484] IPv4: martian source 10.44.0.3 from 10.37.0.13, on dev eth0
[13232.078488] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13232.078883] IPv4: martian source 10.46.0.4 from 10.37.0.13, on dev eth0
[13232.078887] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13232.080465] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13232.106788] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13232.767115] IPv4: martian source 10.32.0.68 from 10.37.0.1, on dev datapath
[13232.771176] ll header: 00000000: ff ff ff ff ff ff a2 a9 9c 3c 7d b8 08 06        .........<}...
[13233.084547] IPv4: martian source 10.42.0.4 from 10.37.0.13, on dev eth0
[13233.084588] IPv4: martian source 10.46.0.4 from 10.37.0.13, on dev eth0
[13233.088171] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13233.089261] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13237.120759] net_ratelimit: 29 callbacks suppressed
[13237.120763] IPv4: martian source 10.32.0.68 from 10.37.0.1, on dev datapath
[13237.125024] ll header: 00000000: ff ff ff ff ff ff a2 a9 9c 3c 7d b8 08 06        .........<}...
[13237.180616] IPv4: martian source 10.42.0.4 from 10.37.0.13, on dev eth0
[13237.184236] IPv4: martian source 10.46.0.4 from 10.37.0.13, on dev eth0
[13237.185068] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13237.189588] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13237.189682] IPv4: martian source 10.44.128.6 from 10.37.0.13, on dev eth0
[13237.194244] IPv4: martian source 10.44.0.3 from 10.37.0.13, on dev eth0
[13237.197311] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13237.201721] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13237.203166] IPv4: martian source 10.41.0.23 from 10.37.0.13, on dev eth0
[13237.204637] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13237.206051] IPv4: martian source 10.38.0.2 from 10.37.0.13, on dev eth0
[13237.207453] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13237.209091] IPv4: martian source 10.32.0.58 from 10.37.0.13, on dev eth0
[13237.210725] ll header: 00000000: ff ff ff ff ff ff 72 8d 36 e8 84 fd 08 06        ......r.6.....
[13238.140459] IPv4: martian source 10.32.0.68 from 10.37.0.1, on dev datapath
[13238.143368] ll header: 00000000: ff ff ff ff ff ff a2 a9 9c 3c 7d b8 08 06        .........<}...
[13240.032970] IPv4: martian source 10.32.0.68 from 10.37.0.1, on dev datapath
[13240.036274] ll header: 00000000: ff ff ff ff ff ff a2 a9 9c 3c 7d b8 08 06        .........<}...

Full ip addr outputs

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:ff:30:cf brd ff:ff:ff:ff:ff:ff
    inet 10.245.146.207/24 brd 10.245.146.255 scope global dynamic ens3
       valid_lft 309sec preferred_lft 309sec
    inet6 fe80::5054:ff:feff:30cf/64 scope link 
       valid_lft forever preferred_lft forever
3: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default 
    link/ether 3a:b6:9b:6c:2d:4f brd ff:ff:ff:ff:ff:ff
    inet 10.6.101.153/32 brd 10.6.101.153 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.2.58.191/32 brd 10.2.58.191 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.217.234/32 brd 10.3.217.234 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.14.152.106/32 brd 10.14.152.106 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.6.206.54/32 brd 10.6.206.54 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.13.140.2/32 brd 10.13.140.2 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.4.39.249/32 brd 10.4.39.249 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.8.202.155/32 brd 10.8.202.155 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.1.153.71/32 brd 10.1.153.71 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.10.166.143/32 brd 10.10.166.143 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.85.6/32 brd 10.3.85.6 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.160.252/32 brd 10.0.160.252 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.2.138.242/32 brd 10.2.138.242 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.5.64.68/32 brd 10.5.64.68 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.12.106.25/32 brd 10.12.106.25 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.7.83.155/32 brd 10.7.83.155 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.4.15.144/32 brd 10.4.15.144 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.7.72.209/32 brd 10.7.72.209 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.5.64.80/32 brd 10.5.64.80 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.2.100.200/32 brd 10.2.100.200 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.12.224.11/32 brd 10.12.224.11 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.11.51.149/32 brd 10.11.51.149 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.15.44.169/32 brd 10.15.44.169 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.7.193.249/32 brd 10.7.193.249 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.10.253.200/32 brd 10.10.253.200 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.6.107.58/32 brd 10.6.107.58 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.5.173.80/32 brd 10.5.173.80 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.4.187.121/32 brd 10.4.187.121 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.4.2.55/32 brd 10.4.2.55 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.5.207.7/32 brd 10.5.207.7 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.1.253.133/32 brd 10.1.253.133 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.0.10/32 brd 10.0.0.10 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.13.37.201/32 brd 10.13.37.201 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.7.103.123/32 brd 10.7.103.123 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.13.1.179/32 brd 10.13.1.179 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.10.178.151/32 brd 10.10.178.151 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.12.164.220/32 brd 10.12.164.220 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.15.51.86/32 brd 10.15.51.86 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.15.0.41/32 brd 10.15.0.41 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.11.235.61/32 brd 10.11.235.61 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.2.186.122/32 brd 10.2.186.122 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.8.183.82/32 brd 10.8.183.82 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.0.1/32 brd 10.0.0.1 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.2.140.95/32 brd 10.2.140.95 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.14.37.97/32 brd 10.14.37.97 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.14.36.6/32 brd 10.14.36.6 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.14.244.186/32 brd 10.14.244.186 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.10.172.56/32 brd 10.10.172.56 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.117.144/32 brd 10.0.117.144 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.5.226.178/32 brd 10.5.226.178 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.6.156.9/32 brd 10.6.156.9 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.232.115/32 brd 10.0.232.115 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.7.95.46/32 brd 10.7.95.46 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.5.111.37/32 brd 10.5.111.37 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.3.120.120/32 brd 10.3.120.120 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.1.229.71/32 brd 10.1.229.71 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.1.150.145/32 brd 10.1.150.145 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.13.121.0/32 brd 10.13.121.0 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.0.124.154/32 brd 10.0.124.154 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.10.16.142/32 brd 10.10.16.142 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.6.56.245/32 brd 10.6.56.245 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
    inet 10.10.240.173/32 brd 10.10.240.173 scope global kube-ipvs0
       valid_lft forever preferred_lft forever
4: datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether be:be:1e:ed:ca:ef brd ff:ff:ff:ff:ff:ff
    inet6 fe80::bcbe:1eff:feed:caef/64 scope link 
       valid_lft forever preferred_lft forever
6: weave: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue state UP group default qlen 1000
    link/ether a2:a9:9c:3c:7d:b8 brd ff:ff:ff:ff:ff:ff
    inet 10.37.0.1/12 brd 10.47.255.255 scope global weave
       valid_lft forever preferred_lft forever
    inet6 fe80::a0a9:9cff:fe3c:7db8/64 scope link 
       valid_lft forever preferred_lft forever
8: vethwe-datapath@vethwe-bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master datapath state UP group default 
    link/ether b2:bc:14:8f:c7:54 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::b0bc:14ff:fe8f:c754/64 scope link 
       valid_lft forever preferred_lft forever
9: vethwe-bridge@vethwe-datapath: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether 36:96:ad:ae:12:17 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::3496:adff:feae:1217/64 scope link 
       valid_lft forever preferred_lft forever
10: vxlan-6784: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65535 qdisc noqueue master datapath state UNKNOWN group default qlen 1000
    link/ether fe:a9:2e:36:96:5b brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fca9:2eff:fe36:965b/64 scope link 
       valid_lft forever preferred_lft forever
12: vethwepl18713f4@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether f6:4f:18:a0:e0:e3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::f44f:18ff:fea0:e0e3/64 scope link 
       valid_lft forever preferred_lft forever
16: vethwepl2a4b84d@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether 92:16:33:47:8c:20 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::9016:33ff:fe47:8c20/64 scope link 
       valid_lft forever preferred_lft forever
18: vethwepl8f96a10@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether ba:9d:49:e8:e7:ec brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::b89d:49ff:fee8:e7ec/64 scope link 
       valid_lft forever preferred_lft forever
20: vethwepl7189f3c@if19: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether ae:84:51:c2:9e:3e brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::ac84:51ff:fec2:9e3e/64 scope link 
       valid_lft forever preferred_lft forever
22: vethwepl933406f@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether 3a:75:57:aa:b1:63 brd ff:ff:ff:ff:ff:ff link-netnsid 5
    inet6 fe80::3875:57ff:feaa:b163/64 scope link 
       valid_lft forever preferred_lft forever
24: vethwepl0dd8eaf@if23: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether 9a:23:34:c6:60:66 brd ff:ff:ff:ff:ff:ff link-netnsid 6
    inet6 fe80::9823:34ff:fec6:6066/64 scope link 
       valid_lft forever preferred_lft forever
26: vethweple1de9cf@if25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether 16:be:df:fd:1e:90 brd ff:ff:ff:ff:ff:ff link-netnsid 7
    inet6 fe80::14be:dfff:fefd:1e90/64 scope link 
       valid_lft forever preferred_lft forever
30: vethwepl253e15b@if29: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether 76:16:3b:2a:95:fd brd ff:ff:ff:ff:ff:ff link-netnsid 9
    inet6 fe80::7416:3bff:fe2a:95fd/64 scope link 
       valid_lft forever preferred_lft forever
32: vethweplcbbd076@if31: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether 76:65:9b:12:13:75 brd ff:ff:ff:ff:ff:ff link-netnsid 10
    inet6 fe80::7465:9bff:fe12:1375/64 scope link 
       valid_lft forever preferred_lft forever
34: vethwepl994227f@if33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether d6:1b:21:2e:4c:5a brd ff:ff:ff:ff:ff:ff link-netnsid 11
    inet6 fe80::d41b:21ff:fe2e:4c5a/64 scope link 
       valid_lft forever preferred_lft forever
36: vethwepld444910@if35: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1376 qdisc noqueue master weave state UP group default 
    link/ether d2:a7:1c:67:d6:c0 brd ff:ff:ff:ff:ff:ff link-netnsid 12
    inet6 fe80::d0a7:1cff:fe67:d6c0/64 scope link 
       valid_lft forever preferred_lft forever

sysctl --system shows

* Applying /etc/sysctl.d/10-console-messages.conf ...
kernel.printk = 4 4 1 7
* Applying /etc/sysctl.d/10-ipv6-privacy.conf ...
* Applying /etc/sysctl.d/10-kernel-hardening.conf ...
kernel.kptr_restrict = 1
* Applying /etc/sysctl.d/10-link-restrictions.conf ...
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
* Applying /etc/sysctl.d/10-lxd-inotify.conf ...
fs.inotify.max_user_instances = 1024
* Applying /etc/sysctl.d/10-magic-sysrq.conf ...
kernel.sysrq = 176
* Applying /etc/sysctl.d/10-network-security.conf ...
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.rp_filter = 1
net.ipv4.tcp_syncookies = 1
* Applying /etc/sysctl.d/10-ptrace.conf ...
kernel.yama.ptrace_scope = 1
* Applying /etc/sysctl.d/10-zeropage.conf ...
vm.mmap_min_addr = 65536
* Applying /usr/lib/sysctl.d/50-default.conf ...
net.ipv4.conf.all.promote_secondaries = 1
net.core.default_qdisc = fq_codel
* Applying /etc/sysctl.d/60-k8s.conf ...
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
* Applying /etc/sysctl.d/60-kernel-hardening.conf ...
kernel.kptr_restrict = 1
kernel.yama.ptrace_scope = 1
kernel.perf_event_paranoid = 2
kernel.randomize_va_space = 2
vm.mmap_min_addr = 65536
kernel.panic = 10
kernel.sysrq = 176
* Applying /etc/sysctl.d/60-net-mem-tune.conf ...
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 65536
net.core.wmem_default = 65536
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
* Applying /etc/sysctl.d/60-net-misc.conf ...
net.ipv4.ip_forward = 1
net.ipv4.neigh.default.gc_stale_time = 120
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.lo.arp_announce = 2
net.ipv4.conf.all.arp_announce = 2
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.default.log_martians = 1
net.ipv4.tcp_rfc1337 = 1
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_synack_retries = 5
net.core.somaxconn = 16384
net.core.netdev_max_backlog = 4096
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_fastopen = 3
net.ipv4.tcp_mtu_probing = 1
net.ipv4.ip_no_pmtu_disc = 1
net.ipv6.conf.all.forwarding = 1
net.ipv6.conf.all.disable_ipv6 = 0
net.ipv6.conf.default.disable_ipv6 = 0
net.ipv6.conf.all.proxy_ndp = 1
net.ipv6.conf.all.use_tempaddr = 0
net.ipv6.conf.default.use_tempaddr = 0
* Applying /etc/sysctl.d/60-sys-tune.conf ...
fs.inotify.max_user_watches = 524288
vm.swappiness = 0
vm.overcommit_memory = 1
fs.file-max = 51200
fs.inotify.max_user_instances = 1024
fs.inotify.max_user_watches = 524288
kernel.printk = 4 4 1 7
fs.protected_hardlinks = 1
fs.protected_symlinks = 1
* Applying /etc/sysctl.d/99-sysctl.conf ...
* Applying /etc/sysctl.conf ...

@ghost
Copy link

ghost commented Sep 24, 2020

I'm dealing with this issue as well. We have Kubernetes clusters on AWS, Azure, and GCP and we're only seeing the martian source errors on GCP. It's also only these clusters that occasionally have network troubles on a single node that require restarting weave to fix. The logs are similar to previous posts.

[181603.734056] IPv4: martian source 172.16.160.12 from 172.16.248.0, on dev datapath
[181603.734059] ll header: 00000000: ff ff ff ff ff ff 6a 1c 4e 92 9b 93 08 06
[181603.734818] IPv4: martian source 172.16.248.12 from 172.16.160.12, on dev datapath
[181603.734848] ll header: 00000000: ff ff ff ff ff ff 3e d7 bd 6c 07 2d 08 06
[181603.820470] IPv4: martian source 172.16.160.0 from 172.16.160.2, on dev datapath
[181603.820474] ll header: 00000000: ff ff ff ff ff ff 5a bb 1e 31 5f 48 08 06
[181603.820568] IPv4: martian source 172.16.160.2 from 172.16.0.4, on dev datapath
[181603.820571] ll header: 00000000: ff ff ff ff ff ff 52 84 98 04 d9 11 08 06
[181609.911887] IPv4: martian source 172.16.72.2 from 172.16.0.4, on dev datapath
[181609.911925] ll header: 00000000: ff ff ff ff ff ff 52 84 98 04 d9 11 08 06
[181611.563568] IPv4: martian source 172.16.160.0 from 172.16.160.3, on dev datapath
[181611.563571] ll header: 00000000: ff ff ff ff ff ff 22 47 32 8c 5e f3 08 06
[181616.535650] IPv4: martian source 172.16.136.7 from 172.16.248.0, on dev datapath
[181616.535654] ll header: 00000000: ff ff ff ff ff ff 6a 1c 4e 92 9b 93 08 06
[181616.535749] IPv4: martian source 172.16.248.12 from 172.16.248.17, on dev datapath
[181616.535750] ll header: 00000000: ff ff ff ff ff ff da 6c 5c 7d 6e 40 08 06
[181622.566883] IPv4: martian source 172.16.16.0 from 172.16.248.17, on dev datapath
[181622.566913] ll header: 00000000: ff ff ff ff ff ff da 6c 5c 7d 6e 40 08 06
[181623.294772] IPv4: martian source 172.16.32.1 from 172.16.0.4, on dev datapath
[181623.294808] ll header: 00000000: ff ff ff ff ff ff 52 84 98 04 d9 11 08 06

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants