Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

[SCALING] weaver connections to peers results in various errors in at larger cluster sizes #3595

Closed
murali-reddy opened this issue Feb 6, 2019 · 3 comments

Comments

@murali-reddy
Copy link
Contributor

murali-reddy commented Feb 6, 2019

What you expected to happen?

On empty cluster with no workload traffic (dataplane traffic) with just Weave-net control plane traffic weave-net pods should scale to 100's and even thousands of nodes

What happened?

At cluster size of 200 nodes with sufficient memory and CPU requested for the weave-net pods (to avoid #3593 ), while weaver process does not crash it fails to connects to peers due to various reasons.

/weave --local status connections
-> 172.20.33.99:6783     pending     none   8a:cd:59:87:41:6c(ip-172-20-33-99.us-west-2.compute.internal)
-> 172.20.40.56:6783     pending     fastdp e2:c2:65:63:da:3a(ip-172-20-40-56.us-west-2.compute.internal) mtu=8912
-> 172.20.81.32:6783     pending     fastdp 56:71:f2:cc:2c:86(ip-172-20-81-32.us-west-2.compute.internal) mtu=8912
-> 172.20.50.215:6783    pending     fastdp fa:7d:34:69:28:a7(ip-172-20-50-215.us-west-2.compute.internal) mtu=8912
<- 172.20.40.229:33689   pending     fastdp 5e:37:6f:54:38:49(ip-172-20-40-229.us-west-2.compute.internal) mtu=8912
-> 172.20.65.70:6783     pending     fastdp d6:a4:f5:d1:30:c9(ip-172-20-65-70.us-west-2.compute.internal) mtu=8912
-> 172.20.75.16:6783     pending     none   aa:8d:ef:6d:29:92(ip-172-20-75-16.us-west-2.compute.internal)
<- 172.20.69.126:36809   pending     fastdp 7e:3b:25:46:24:04(ip-172-20-69-126.us-west-2.compute.internal) mtu=8912
-> 172.20.59.101:6783    pending     none   d6:7d:7e:c7:44:74(ip-172-20-59-101.us-west-2.compute.internal)
-> 172.20.93.123:6783    pending     none   8a:0c:c3:37:d0:6f(ip-172-20-93-123.us-west-2.compute.internal)
-> 172.20.74.170:6783    pending     fastdp 4a:94:72:d7:30:58(ip-172-20-74-170.us-west-2.compute.internal) mtu=8912
<- 172.20.45.166:43416   pending     none   1a:dc:ba:6a:42:26(ip-172-20-45-166.us-west-2.compute.internal)
-> 172.20.69.190:6783    pending     fastdp 76:e0:b4:17:d5:22(ip-172-20-69-190.us-west-2.compute.internal) mtu=8912
-> 172.20.88.20:6783     pending     fastdp 12:9b:1a:7e:6a:71(ip-172-20-88-20.us-west-2.compute.internal) mtu=8912
-> 172.20.45.208:6783    pending     none   4a:75:9f:b7:84:7e(ip-172-20-45-208.us-west-2.compute.internal)
-> 172.20.72.144:6783    pending     fastdp 56:6d:a8:7a:af:03(ip-172-20-72-144.us-west-2.compute.internal) mtu=8912
-> 172.20.36.159:6783    pending     none   c2:d3:01:28:80:72(ip-172-20-36-159.us-west-2.compute.internal)
-> 172.20.64.32:6783     pending     fastdp a6:6d:d9:2a:b5:79(ip-172-20-64-32.us-west-2.compute.internal) mtu=8912
-> 172.20.85.79:6783     pending     fastdp e6:36:fd:17:aa:b7(ip-172-20-85-79.us-west-2.compute.internal) mtu=8912
-> 172.20.56.230:6783    pending     fastdp 56:f8:f4:12:0b:38(ip-172-20-56-230.us-west-2.compute.internal) mtu=8912
-> 172.20.64.121:6783    pending     none   22:ef:c3:40:98:58(ip-172-20-64-121.us-west-2.compute.internal)
-> 172.20.83.5:6783      pending     fastdp e2:04:e3:3f:13:fc(ip-172-20-83-5.us-west-2.compute.internal) mtu=8912
-> 172.20.43.56:6783     pending     fastdp ca:cc:05:e3:aa:87(ip-172-20-43-56.us-west-2.compute.internal) mtu=8912
-> 172.20.88.57:6783     pending     none   f6:21:b0:40:35:a9(ip-172-20-88-57.us-west-2.compute.internal)
-> 172.20.52.121:6783    pending     fastdp 2e:92:d8:f1:fc:5b(ip-172-20-52-121.us-west-2.compute.internal) mtu=8912
-> 172.20.58.243:6783    pending     fastdp a2:50:a5:8a:03:dc(ip-172-20-58-243.us-west-2.compute.internal) mtu=8912
<- 172.20.48.123:27466   established sleeve 9e:a2:5f:dc:a1:6d(ip-172-20-48-123.us-west-2.compute.internal) mtu=1437
-> 172.20.48.130:6783    pending     fastdp de:92:89:48:4d:52(ip-172-20-48-130.us-west-2.compute.internal) mtu=8912
-> 172.20.45.127:6783    pending     fastdp 42:05:d1:06:28:1e(ip-172-20-45-127.us-west-2.compute.internal) mtu=8912
-> 172.20.79.29:6783     pending     fastdp ca:40:cb:58:5a:f0(ip-172-20-79-29.us-west-2.compute.internal) mtu=8912
-> 172.20.69.171:6783    pending     none   2e:40:0b:84:0e:8c(ip-172-20-69-171.us-west-2.compute.internal)
-> 172.20.61.156:6783    pending     fastdp 0a:47:05:10:4f:70(ip-172-20-61-156.us-west-2.compute.internal) mtu=8912
-> 172.20.81.74:6783     pending     none   5e:23:ab:aa:d5:3c(ip-172-20-81-74.us-west-2.compute.internal)
-> 172.20.46.216:6783    pending     none   42:5b:69:b4:24:2a(ip-172-20-46-216.us-west-2.compute.internal)
-> 172.20.75.225:6783    pending     fastdp ce:96:a9:f7:15:88(ip-172-20-75-225.us-west-2.compute.internal) mtu=8912
-> 172.20.89.254:6783    pending     fastdp 82:33:3d:60:1a:2b(ip-172-20-89-254.us-west-2.compute.internal) mtu=8912
-> 172.20.75.121:6783    pending     fastdp 96:84:dc:c3:d1:f0(ip-172-20-75-121.us-west-2.compute.internal) mtu=8912
-> 172.20.75.26:6783     pending     none   4e:0b:a9:76:b3:1d(ip-172-20-75-26.us-west-2.compute.internal)
-> 172.20.41.21:6783     pending     none   4a:e2:c0:b7:5d:88(ip-172-20-41-21.us-west-2.compute.internal)
-> 172.20.65.249:6783    pending     fastdp de:82:f3:f3:ab:f6(ip-172-20-65-249.us-west-2.compute.internal) mtu=8912
-> 172.20.46.204:6783    pending     none   8e:86:e0:dd:07:18(ip-172-20-46-204.us-west-2.compute.internal)
-> 172.20.90.248:6783    established sleeve 42:a8:69:50:44:ca(ip-172-20-90-248.us-west-2.compute.internal) mtu=1438
-> 172.20.40.132:6783    pending     fastdp d6:f2:39:10:c6:0e(ip-172-20-40-132.us-west-2.compute.internal) mtu=8912
-> 172.20.95.154:6783    pending     fastdp 4e:61:6d:18:12:da(ip-172-20-95-154.us-west-2.compute.internal) mtu=8912
-> 172.20.46.243:6783    pending     fastdp 86:1a:15:d8:8f:54(ip-172-20-46-243.us-west-2.compute.internal) mtu=8912
<- 172.20.51.54:10407    pending     fastdp 9e:bf:f4:6b:0c:0d(ip-172-20-51-54.us-west-2.compute.internal) mtu=8912
-> 172.20.83.88:6783     pending     fastdp 56:57:ec:b4:60:12(ip-172-20-83-88.us-west-2.compute.internal) mtu=8912
-> 172.20.63.88:6783     pending     none   b6:f4:35:d1:b8:19(ip-172-20-63-88.us-west-2.compute.internal)
-> 172.20.68.113:6783    pending     fastdp 06:19:80:55:f3:ba(ip-172-20-68-113.us-west-2.compute.internal) mtu=8912
<- 172.20.88.40:30522    pending     fastdp 4a:11:bf:1d:a0:a6(ip-172-20-88-40.us-west-2.compute.internal) mtu=8912
-> 172.20.54.141:6783    pending     fastdp 32:ba:e2:0e:aa:59(ip-172-20-54-141.us-west-2.compute.internal) mtu=8912
-> 172.20.73.221:6783    established sleeve d6:0d:28:bb:9a:49(ip-172-20-73-221.us-west-2.compute.internal) mtu=1438
-> 172.20.38.39:6783     pending     none   ee:bf:a1:fd:91:de(ip-172-20-38-39.us-west-2.compute.internal)
<- 172.20.51.128:38633   pending     fastdp 62:0a:6a:bb:ce:90(ip-172-20-51-128.us-west-2.compute.internal) mtu=8912
-> 172.20.84.9:6783      pending     none   ce:dd:52:c9:83:5d(ip-172-20-84-9.us-west-2.compute.internal)
-> 172.20.70.164:6783    pending     fastdp 1e:01:c6:d7:cc:c7(ip-172-20-70-164.us-west-2.compute.internal) mtu=8912
-> 172.20.82.214:6783    pending     fastdp 42:53:72:c1:f2:3c(ip-172-20-82-214.us-west-2.compute.internal) mtu=8912
-> 172.20.57.153:6783    pending     fastdp 3a:c5:86:76:4d:94(ip-172-20-57-153.us-west-2.compute.internal) mtu=8912
<- 172.20.95.36:37232    pending     fastdp 72:3d:d3:67:52:9d(ip-172-20-95-36.us-west-2.compute.internal) mtu=8912
-> 172.20.74.165:6783    pending     none   9e:21:58:fb:c8:25(ip-172-20-74-165.us-west-2.compute.internal)
-> 172.20.62.156:6783    pending     fastdp 26:bb:0f:03:a7:8e(ip-172-20-62-156.us-west-2.compute.internal) mtu=8912
<- 172.20.39.207:20149   pending     fastdp 5e:26:06:43:af:f2(ip-172-20-39-207.us-west-2.compute.internal) mtu=8912
<- 172.20.55.102:28813   established sleeve fa:ab:a8:92:f1:5e(ip-172-20-55-102.us-west-2.compute.internal) mtu=1438
-> 172.20.93.234:6783    pending     none   12:72:9e:fc:e2:b9(ip-172-20-93-234.us-west-2.compute.internal)
-> 172.20.65.177:6783    pending     none   72:b6:f5:2a:14:46(ip-172-20-65-177.us-west-2.compute.internal)
-> 172.20.46.208:6783    pending     none   5e:d6:97:db:64:49(ip-172-20-46-208.us-west-2.compute.internal)
-> 172.20.89.63:6783     pending     none   6e:3c:09:6a:b7:86(ip-172-20-89-63.us-west-2.compute.internal)
-> 172.20.73.2:6783      established sleeve f2:51:2a:fc:5d:0c(ip-172-20-73-2.us-west-2.compute.internal) mtu=1438
-> 172.20.70.91:6783     pending     fastdp fe:cb:15:e1:f2:e0(ip-172-20-70-91.us-west-2.compute.internal) mtu=8912
-> 172.20.88.40:6783     pending     none   4a:11:bf:1d:a0:a6(ip-172-20-88-40.us-west-2.compute.internal)
<- 172.20.33.134:59783   pending     none   92:bc:13:9e:7e:ad(ip-172-20-33-134.us-west-2.compute.internal)
-> 172.20.34.244:6783    pending     none   ce:e9:ac:27:93:3d(ip-172-20-34-244.us-west-2.compute.internal)
<- 172.20.84.9:36887     pending     fastdp ce:dd:52:c9:83:5d(ip-172-20-84-9.us-west-2.compute.internal) mtu=8912
-> 172.20.56.9:6783      pending     fastdp da:35:de:fc:40:00(ip-172-20-56-9.us-west-2.compute.internal) mtu=8912
<- 172.20.56.9:17141     pending     fastdp da:35:de:fc:40:00(ip-172-20-56-9.us-west-2.compute.internal) mtu=8912
-> 172.20.69.231:6783    pending     none   e6:58:ad:01:f1:12(ip-172-20-69-231.us-west-2.compute.internal)
<- 172.20.43.11:49382    established sleeve 1a:5d:b4:d6:0f:c0(ip-172-20-43-11.us-west-2.compute.internal) mtu=1438
-> 172.20.72.9:6783      pending     none   7a:0a:3d:a8:30:6b(ip-172-20-72-9.us-west-2.compute.internal)
-> 172.20.49.14:6783     pending     fastdp c6:dd:8f:bb:13:e9(ip-172-20-49-14.us-west-2.compute.internal) mtu=8912
<- 172.20.85.147:27084   pending     none   da:b6:41:b8:30:a3(ip-172-20-85-147.us-west-2.compute.internal)
<- 172.20.74.165:41335   pending     none   9e:21:58:fb:c8:25(ip-172-20-74-165.us-west-2.compute.internal)
<- 172.20.85.126:28186   established sleeve 2a:b6:1d:fc:83:77(ip-172-20-85-126.us-west-2.compute.internal) mtu=1438
-> 172.20.86.109:6783    pending     fastdp 0a:f8:96:9e:41:e2(ip-172-20-86-109.us-west-2.compute.internal) mtu=8912
-> 172.20.43.55:6783     pending     none   5a:74:54:93:4d:54(ip-172-20-43-55.us-west-2.compute.internal)
<- 172.20.72.14:43666    pending     none   2a:13:fc:30:47:70(ip-172-20-72-14.us-west-2.compute.internal)
-> 172.20.67.185:6783    pending     none   be:a6:c3:02:c6:9a(ip-172-20-67-185.us-west-2.compute.internal)
-> 172.20.58.161:6783    pending     none   a6:c2:14:b8:5c:16(ip-172-20-58-161.us-west-2.compute.internal)
-> 172.20.38.65:6783     pending     fastdp 2a:bf:31:9e:5f:f8(ip-172-20-38-65.us-west-2.compute.internal) mtu=8912
-> 172.20.91.217:6783    pending     none   d2:94:b3:b8:26:0f(ip-172-20-91-217.us-west-2.compute.internal)
-> 172.20.92.29:6783     pending     fastdp d6:cf:a2:05:a9:36(ip-172-20-92-29.us-west-2.compute.internal) mtu=8912
-> 172.20.80.54:6783     pending     none   ae:35:8b:48:fa:66(ip-172-20-80-54.us-west-2.compute.internal)
<- 172.20.56.168:53665   established sleeve 36:df:1b:90:26:73(ip-172-20-56-168.us-west-2.compute.internal) mtu=772
-> 172.20.34.53:6783     pending     none   c6:c2:20:d3:ed:31(ip-172-20-34-53.us-west-2.compute.internal)
-> 172.20.66.194:6783    pending     fastdp be:4e:67:bc:a4:f7(ip-172-20-66-194.us-west-2.compute.internal) mtu=8912
-> 172.20.51.54:6783     pending     fastdp 9e:bf:f4:6b:0c:0d(ip-172-20-51-54.us-west-2.compute.internal) mtu=8912
-> 172.20.69.1:6783      pending     fastdp 06:fa:8c:d2:4b:f8(ip-172-20-69-1.us-west-2.compute.internal) mtu=8912
<- 172.20.51.50:64848    established sleeve ce:1d:8b:ed:22:fd(ip-172-20-51-50.us-west-2.compute.internal) mtu=1431
<- 172.20.68.115:14661   pending     fastdp 7a:78:54:4d:6f:e4(ip-172-20-68-115.us-west-2.compute.internal) mtu=8912
<- 172.20.33.174:27282   pending     fastdp f2:55:2e:ae:16:7c(ip-172-20-33-174.us-west-2.compute.internal) mtu=8912
-> 172.20.40.229:6783    pending     fastdp 5e:37:6f:54:38:49(ip-172-20-40-229.us-west-2.compute.internal) mtu=8912
-> 172.20.59.28:6783     pending     none   3a:39:14:c8:37:d7(ip-172-20-59-28.us-west-2.compute.internal)
-> 172.20.37.249:6783    pending     fastdp 12:5c:30:66:8d:04(ip-172-20-37-249.us-west-2.compute.internal) mtu=8912
-> 172.20.68.115:6783    pending     none   7a:78:54:4d:6f:e4(ip-172-20-68-115.us-west-2.compute.internal)
-> 172.20.71.159:6783    pending     none   b6:af:8e:99:8a:40(ip-172-20-71-159.us-west-2.compute.internal)
<- 172.20.67.198:23705   pending     fastdp 0a:49:a7:d4:a8:c0(ip-172-20-67-198.us-west-2.compute.internal) mtu=8912
-> 172.20.93.111:6783    pending     none   2a:8f:3f:98:65:0a(ip-172-20-93-111.us-west-2.compute.internal)
-> 172.20.52.91:6783     pending     none   82:c7:cd:95:ff:04(ip-172-20-52-91.us-west-2.compute.internal)
-> 172.20.32.244:6783    pending     fastdp 4e:b1:b9:09:91:f9(ip-172-20-32-244.us-west-2.compute.internal) mtu=8912
-> 172.20.72.201:6783    pending     fastdp 62:ed:5a:87:25:41(ip-172-20-72-201.us-west-2.compute.internal) mtu=8912
-> 172.20.45.20:6783     pending     fastdp de:88:b1:a6:64:55(ip-172-20-45-20.us-west-2.compute.internal) mtu=8912
-> 172.20.37.58:6783     pending     fastdp fa:80:e3:32:d1:4b(ip-172-20-37-58.us-west-2.compute.internal) mtu=8912
<- 172.20.73.125:33893   established sleeve c6:43:d5:5d:dd:06(ip-172-20-73-125.us-west-2.compute.internal) mtu=1438
-> 172.20.79.83:6783     pending     none   96:f7:2d:fb:c7:06(ip-172-20-79-83.us-west-2.compute.internal)
-> 172.20.67.171:6783    pending     fastdp 12:ed:9f:34:22:a2(ip-172-20-67-171.us-west-2.compute.internal) mtu=8912
-> 172.20.59.82:6783     pending     fastdp fe:0f:f8:5d:57:b8(ip-172-20-59-82.us-west-2.compute.internal) mtu=8912
-> 172.20.65.153:6783    pending     fastdp 96:88:8c:a1:9f:e7(ip-172-20-65-153.us-west-2.compute.internal) mtu=8912
-> 172.20.35.115:6783    pending     fastdp 16:b8:84:52:3f:bf(ip-172-20-35-115.us-west-2.compute.internal) mtu=8912
-> 172.20.46.102:6783    pending     fastdp ae:aa:1d:11:73:fa(ip-172-20-46-102.us-west-2.compute.internal) mtu=8912
-> 172.20.86.131:6783    pending     none   32:f6:9f:d2:be:28(ip-172-20-86-131.us-west-2.compute.internal)
-> 172.20.77.20:6783     pending     none   22:6f:ff:85:f3:f0(ip-172-20-77-20.us-west-2.compute.internal)
-> 172.20.43.222:6783    pending     fastdp 12:02:67:09:34:4b(ip-172-20-43-222.us-west-2.compute.internal) mtu=8912
-> 172.20.39.201:6783    pending     fastdp 7e:3a:93:14:5a:53(ip-172-20-39-201.us-west-2.compute.internal) mtu=8912
<- 172.20.37.42:59907    pending     none   1a:ab:4d:3f:69:ae(ip-172-20-37-42.us-west-2.compute.internal)
-> 172.20.58.244:6783    pending     none   fe:09:43:ae:22:7f(ip-172-20-58-244.us-west-2.compute.internal)
<- 172.20.67.191:25665   pending     none   ea:98:66:1d:ec:4a(ip-172-20-67-191.us-west-2.compute.internal)
-> 172.20.64.9:6783      pending     none   8e:ce:b7:0c:d7:20(ip-172-20-64-9.us-west-2.compute.internal)
-> 172.20.38.31:6783     established sleeve ea:02:5d:c9:ab:d6(ip-172-20-38-31.us-west-2.compute.internal) mtu=1438
-> 172.20.84.251:6783    pending     none   e6:f6:63:ff:68:8e(ip-172-20-84-251.us-west-2.compute.internal)
-> 172.20.36.181:6783    pending     fastdp 4e:ed:6a:b6:29:4d(ip-172-20-36-181.us-west-2.compute.internal) mtu=8912
-> 172.20.35.131:6783    established sleeve 4e:4b:69:ef:6a:2c(ip-172-20-35-131.us-west-2.compute.internal) mtu=1438
-> 172.20.40.58:6783     pending     fastdp 96:ea:bf:a7:58:1e(ip-172-20-40-58.us-west-2.compute.internal) mtu=8912
-> 172.20.62.224:6783    pending     none   ba:55:63:2d:fd:9d(ip-172-20-62-224.us-west-2.compute.internal)
<- 172.20.85.253:39844   pending     fastdp ee:5a:bf:7b:67:40(ip-172-20-85-253.us-west-2.compute.internal) mtu=8912
-> 172.20.57.191:6783    pending     fastdp a2:8a:5d:f0:1f:8b(ip-172-20-57-191.us-west-2.compute.internal) mtu=8912
-> 172.20.47.37:6783     pending     fastdp f6:b2:35:a2:ab:82(ip-172-20-47-37.us-west-2.compute.internal) mtu=8912
-> 172.20.66.137:6783    pending     fastdp 0e:b3:e8:62:b2:11(ip-172-20-66-137.us-west-2.compute.internal) mtu=8912
-> 172.20.77.93:6783     pending     fastdp 6a:46:98:98:cd:06(ip-172-20-77-93.us-west-2.compute.internal) mtu=8912
<- 172.20.74.85:41471    established sleeve d2:1e:e9:12:2e:32(ip-172-20-74-85.us-west-2.compute.internal) mtu=1438
-> 172.20.92.116:6783    pending     fastdp da:ff:b7:d2:c4:b2(ip-172-20-92-116.us-west-2.compute.internal) mtu=8912
-> 172.20.79.12:6783     pending     fastdp ba:c1:4f:69:a5:a8(ip-172-20-79-12.us-west-2.compute.internal) mtu=8912
-> 172.20.36.4:6783      pending     none   12:68:e0:d3:c3:5a(ip-172-20-36-4.us-west-2.compute.internal)
-> 172.20.78.127:6783    pending     fastdp ca:d0:fb:c5:a9:bd(ip-172-20-78-127.us-west-2.compute.internal) mtu=8912
-> 172.20.59.223:6783    pending     fastdp da:91:c1:04:0a:a0(ip-172-20-59-223.us-west-2.compute.internal) mtu=8912
<- 172.20.74.20:31849    pending     none   a6:85:f5:51:66:75(ip-172-20-74-20.us-west-2.compute.internal)
-> 172.20.53.199:6783    pending     none   c2:63:02:67:cf:8b(ip-172-20-53-199.us-west-2.compute.internal)
-> 172.20.45.113:6783    pending     fastdp 22:b6:8e:b1:a0:e7(ip-172-20-45-113.us-west-2.compute.internal) mtu=8912
-> 172.20.32.39:6783     pending     fastdp 3e:24:61:11:d3:f8(ip-172-20-32-39.us-west-2.compute.internal) mtu=8912
-> 172.20.52.46:6783     pending     none   22:a6:78:7c:84:cc(ip-172-20-52-46.us-west-2.compute.internal)
-> 172.20.92.218:6783    pending     fastdp da:c3:5d:e1:cf:9f(ip-172-20-92-218.us-west-2.compute.internal) mtu=8912
-> 172.20.45.164:6783    pending     none   1e:e5:2a:25:7e:90(ip-172-20-45-164.us-west-2.compute.internal)
-> 172.20.73.47:6783     pending     fastdp 0e:76:c7:72:d6:3d(ip-172-20-73-47.us-west-2.compute.internal) mtu=8912
<- 172.20.46.216:50066   pending     fastdp 42:5b:69:b4:24:2a(ip-172-20-46-216.us-west-2.compute.internal) mtu=8912
-> 172.20.32.55:6783     pending     none   3a:a5:fe:cc:8d:c2(ip-172-20-32-55.us-west-2.compute.internal)
-> 172.20.66.15:6783     pending     none   7a:b6:f3:0a:f5:ba(ip-172-20-66-15.us-west-2.compute.internal)
-> 172.20.86.182:6783    pending     fastdp 06:8c:3e:6f:7d:5f(ip-172-20-86-182.us-west-2.compute.internal) mtu=8912
-> 172.20.47.161:6783    pending     none   aa:4a:49:4f:31:31(ip-172-20-47-161.us-west-2.compute.internal)
-> 172.20.48.250:6783    pending     none   9e:f1:26:40:ef:10(ip-172-20-48-250.us-west-2.compute.internal)
-> 172.20.33.174:6783    pending     none   f2:55:2e:ae:16:7c(ip-172-20-33-174.us-west-2.compute.internal)
-> 172.20.43.192:6783    pending     none   ca:37:ff:0d:bf:b3(ip-172-20-43-192.us-west-2.compute.internal)
-> 172.20.55.52:6783     pending     none   46:a7:fc:4b:df:ff(ip-172-20-55-52.us-west-2.compute.internal)
-> 172.20.82.117:6783    pending     none   c2:9e:94:1a:90:fe(ip-172-20-82-117.us-west-2.compute.internal)
<- 172.20.47.226:58163   established sleeve 3a:f4:1d:b1:89:3f(ip-172-20-47-226.us-west-2.compute.internal) mtu=1438
-> 172.20.77.34:6783     pending     fastdp 52:e5:80:09:dc:45(ip-172-20-77-34.us-west-2.compute.internal) mtu=8912
-> 172.20.70.25:6783     pending     fastdp be:db:3c:81:7e:95(ip-172-20-70-25.us-west-2.compute.internal) mtu=8912
-> 172.20.95.63:6783     pending     none   32:1e:9d:18:ec:f3(ip-172-20-95-63.us-west-2.compute.internal)
<- 172.20.48.130:23691   pending     fastdp de:92:89:48:4d:52(ip-172-20-48-130.us-west-2.compute.internal) mtu=8912
-> 172.20.41.189:6783    pending     none   e6:45:59:5d:37:c7(ip-172-20-41-189.us-west-2.compute.internal)
-> 172.20.92.225:6783    pending     none   56:4f:6f:fd:45:60(ip-172-20-92-225.us-west-2.compute.internal)
-> 172.20.84.48:6783     pending     none   ba:51:49:77:7f:6f(ip-172-20-84-48.us-west-2.compute.internal)
-> 172.20.90.60:6783     pending     fastdp aa:fe:a6:2c:97:d5(ip-172-20-90-60.us-west-2.compute.internal) mtu=8912
-> 172.20.50.147:6783    pending     fastdp be:0c:9e:f9:b1:ac(ip-172-20-50-147.us-west-2.compute.internal) mtu=8912
<- 172.20.41.74:53829    established sleeve 26:e0:af:cc:c9:b4(ip-172-20-41-74.us-west-2.compute.internal) mtu=1438
-> 172.20.57.134:6783    retrying    no working forwarders to 26:27:46:6d:7f:05(ip-172-20-57-134.us-west-2.compute.internal)
-> 172.20.49.98:6783     retrying    Multiple connections to 56:f2:7a:ba:82:ea(ip-172-20-49-98.us-west-2.compute.internal) added to 72:b6:4d:02:51:03(ip-172-20-83-113.us-west-2.compute.internal)
-> 172.20.77.159:6783    retrying    no working forwarders to ae:7e:5a:46:f9:94(ip-172-20-77-159.us-west-2.compute.internal)
-> 172.20.84.170:6783    retrying    read tcp4 172.20.83.113:36800->172.20.84.170:6783: i/o timeout
-> 172.20.57.88:6783     retrying    read tcp4 172.20.83.113:44699->172.20.57.88:6783: i/o timeout
-> 172.20.61.185:6783    retrying    no working forwarders to 26:db:c4:08:b4:b8(ip-172-20-61-185.us-west-2.compute.internal)
-> 172.20.74.37:6783     retrying    no working forwarders to b6:58:9d:e7:c3:69(ip-172-20-74-37.us-west-2.compute.internal)
-> 172.20.46.140:6783    retrying    no working forwarders to ca:10:cb:b1:bb:96(ip-172-20-46-140.us-west-2.compute.internal)
-> 172.20.42.113:6783    retrying    read tcp4 172.20.83.113:24184->172.20.42.113:6783: i/o timeout
-> 172.20.92.199:6783    retrying    no working forwarders to ca:1d:05:e3:52:31(ip-172-20-92-199.us-west-2.compute.internal)
-> 172.20.48.136:6783    retrying    read tcp4 172.20.83.113:28665->172.20.48.136:6783: i/o timeout
-> 172.20.74.252:6783    retrying    read tcp4 172.20.83.113:42920->172.20.74.252:6783: i/o timeout
-> 172.20.54.131:6783    retrying    Multiple connections to ba:7b:10:ba:8e:96(ip-172-20-54-131.us-west-2.compute.internal) added to 72:b6:4d:02:51:03(ip-172-20-83-113.us-west-2.compute.internal)
-> 172.20.32.57:6783     retrying    write tcp4 172.20.83.113:43773->172.20.32.57:6783: write: broken pipe
-> 172.20.39.242:6783    retrying    Multiple connections to 36:a7:de:7e:10:31(ip-172-20-39-242.us-west-2.compute.internal) added to 72:b6:4d:02:51:03(ip-172-20-83-113.us-west-2.compute.internal)
-> 172.20.71.163:6783    retrying    read tcp4 172.20.83.113:49690->172.20.71.163:6783: i/o timeout
-> 172.20.85.181:6783    retrying    write tcp4 172.20.83.113:41822->172.20.85.181:6783: write: broken pipe
-> 172.20.94.126:6783    retrying    no working forwarders to a2:45:ef:e4:8a:82(ip-172-20-94-126.us-west-2.compute.internal)
-> 172.20.63.252:6783    retrying    write tcp4 172.20.83.113:36174->172.20.63.252:6783: write: broken pipe
-> 172.20.81.26:6783     retrying    read tcp4 172.20.83.113:45127->172.20.81.26:6783: i/o timeout
-> 172.20.79.17:6783     retrying    no working forwarders to de:65:f6:50:a6:40(ip-172-20-79-17.us-west-2.compute.internal)
-> 172.20.68.233:6783    retrying    read tcp4 172.20.83.113:45218->172.20.68.233:6783: i/o timeout
-> 172.20.57.207:6783    failed      dial tcp4 :0->172.20.57.207:6783: connect: connection refused, retry: 2019-02-06 11:56:40.990686544 +0000 UTC m=+496.669563883
-> 172.20.53.188:6783    retrying    read tcp4 172.20.83.113:49380->172.20.53.188:6783: i/o timeout
-> 172.20.89.87:6783     retrying    write tcp4 172.20.83.113:12262->172.20.89.87:6783: write: broken pipe
-> 172.20.56.52:6783     retrying    no working forwarders to f2:cd:83:66:79:b4(ip-172-20-56-52.us-west-2.compute.internal)
-> 172.20.59.181:6783    retrying    no working forwarders to 8e:13:07:ef:d3:2d(ip-172-20-59-181.us-west-2.compute.internal)
-> 172.20.56.168:6783    retrying    write tcp4 172.20.83.113:44407->172.20.56.168:6783: write: broken pipe
-> 172.20.64.79:6783     retrying    write tcp4 172.20.83.113:24116->172.20.64.79:6783: write: broken pipe
-> 172.20.62.110:6783    retrying    read tcp4 172.20.83.113:64287->172.20.62.110:6783: i/o timeout
-> 172.20.49.107:6783    failed      dial tcp4 :0->172.20.49.107:6783: connect: connection refused, retry: 2019-02-06 11:53:26.414333267 +0000 UTC m=+302.093210632
-> 172.20.56.187:6783    retrying    read tcp4 172.20.83.113:53265->172.20.56.187:6783: i/o timeout
-> 172.20.35.55:6783     retrying    no working forwarders to be:01:39:2b:26:65(ip-172-20-35-55.us-west-2.compute.internal)
-> 172.20.57.50:6783     retrying    read tcp4 172.20.83.113:47644->172.20.57.50:6783: i/o timeout
-> 172.20.71.190:6783    retrying    read tcp4 172.20.83.113:61972->172.20.71.190:6783: i/o timeout
-> 172.20.42.79:6783     retrying    no working forwarders to 8a:4c:cb:ce:46:50(ip-172-20-42-79.us-west-2.compute.internal)
-> 172.20.64.212:6783    retrying    read tcp4 172.20.83.113:25892->172.20.64.212:6783: i/o timeout
-> 172.20.80.76:6783     retrying    no working forwarders to 4e:f7:73:dd:ed:3a(ip-172-20-80-76.us-west-2.compute.internal)
-> 172.20.32.84:6783     retrying    no working forwarders to fe:29:2c:f9:ed:66(ip-172-20-32-84.us-west-2.compute.internal)
-> 172.20.95.36:6783     retrying    read tcp4 172.20.83.113:31947->172.20.95.36:6783: read: connection reset by peer

Logs indicate various sorts of errors as well.

As I narrow down the root cause based on the symptoms I will refine the bug accordingly.

How to reproduce it?

Increase CPU and memory requests to 500m and 500 MB so that pods does not crash or OOMKilled and increase the connection limit (CONN_LIMIT) and provision more than 150 nodes

Versions:

$ weave version
2.5.1
$ kubectl version
v1.10.8

Logs:

$ kubectl logs -n kube-system <weave-net-pod> weave
019/02/06 11:59:04.996723 ->[172.20.89.63:11787] connection accepted
INFO: 2019/02/06 11:59:04.997252 Removed unreachable peer 6a:de:31:cf:24:2f(ip-172-20-74-252.us-west-2.compute.internal)
INFO: 2019/02/06 11:59:05.208747 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
gc 436 @640.676s 14%: 0.11+499+0.068 ms clock, 0.22+403/197/0+0.13 ms cpu, 165->172->85 MB, 174 MB goal, 2 P
INFO: 2019/02/06 11:59:05.509295 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.601341 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.611366 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.612293 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.612741 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.617464 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.618458 ->[172.20.89.254:6783|82:33:3d:60:1a:2b(ip-172-20-89-254.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/02/06 11:59:05.619166 ->[172.20.54.141:6783|32:ba:e2:0e:aa:59(ip-172-20-54-141.us-west-2.compute.internal)]: connection deleted
INFO: 2019/02/06 11:59:05.696022 ->[172.20.81.32:41081] connection accepted
INFO: 2019/02/06 11:59:05.697484 ->[172.20.67.185:27325|be:a6:c3:02:c6:9a(ip-172-20-67-185.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to be:a6:c3:02:c6:9a(ip-172-20-67-185.us-west-2.compute.internal) added to 72:b6:4d:02:51:03(ip-172-20-83-113.us-west-2.compute.internal)
INFO: 2019/02/06 11:59:05.697554 ->[172.20.86.131:27966|32:f6:9f:d2:be:28(ip-172-20-86-131.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 32:f6:9f:d2:be:28(ip-172-20-86-131.us-west-2.compute.internal) added to 72:b6:4d:02:51:03(ip-172-20-83-113.us-west-2.compute.internal)
INFO: 2019/02/06 11:59:05.702059 ->[172.20.32.55:6783|3a:a5:fe:cc:8d:c2(ip-172-20-32-55.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/02/06 11:59:05.702147 overlay_switch ->[3a:a5:fe:cc:8d:c2(ip-172-20-32-55.us-west-2.compute.internal)] using fastdp
INFO: 2019/02/06 11:59:05.707288 ->[172.20.74.252:6783] attempting connection
INFO: 2019/02/06 11:59:05.710945 overlay_switch ->[62:0a:6a:bb:ce:90(ip-172-20-51-128.us-west-2.compute.internal)] using sleeve
INFO: 2019/02/06 11:59:05.711187 ->[172.20.51.128:62804|62:0a:6a:bb:ce:90(ip-172-20-51-128.us-west-2.compute.internal)]: connection shutting down due to error: read tcp4 172.20.83.113:6783->172.20.51.128:62804: i/o timeout
INFO: 2019/02/06 11:59:05.803459 overlay_switch ->[0e:b3:e8:62:b2:11(ip-172-20-66-137.us-west-2.compute.internal)] using sleeve
INFO: 2019/02/06 11:59:05.803517 ->[172.20.66.137:6783|0e:b3:e8:62:b2:11(ip-172-20-66-137.us-west-2.compute.internal)]: connection shutting down due to error: read tcp4 172.20.83.113:28846->172.20.66.137:6783: i/o timeout
INFO: 2019/02/06 11:59:05.803721 overlay_switch ->[62:0a:6a:bb:ce:90(ip-172-20-51-128.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/02/06 11:59:05.803819 overlay_switch ->[0e:b3:e8:62:b2:11(ip-172-20-66-137.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/02/06 11:59:05.904841 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.906845 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.907700 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.996835 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.006184 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.007340 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.016334 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.016894 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.099141 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.104001 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.104854 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.105366 Removed unreachable peer 6a:de:31:cf:24:2f()
@veeshall
Copy link

@murali-reddy We saw similar behavior after around 175 nodes. Are there any options to address this scenario?

@murali-reddy
Copy link
Contributor Author

Unfortunately no workarounds. Some of the scaling issues are being addressed in 2.6 release.

@bboreham
Copy link
Contributor

bboreham commented Jul 7, 2020

I'm going to close this as addressed in the many improvements between the date of filing and version 2.6.5.

@bboreham bboreham closed this as completed Jul 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants