Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node CNI unstably curl nodeport #591

Closed
williamyao1982 opened this issue Aug 20, 2019 · 7 comments
Closed

aws-node CNI unstably curl nodeport #591

williamyao1982 opened this issue Aug 20, 2019 · 7 comments

Comments

@williamyao1982
Copy link

Dear Team,

We used KOPS to deploy kubernetes cluster in AWS cloud with aws-node CNI. The strange thing is we have two work node, one worknode works alright, another one can't curl the nodeport successfully everytime.

I only disabled selinux manually in both of them and install cloudwatch agent for them.

For example(no response until timeout, but sometime it will get the feedback with 200):
/ # curl http://172.31.73.171:30093

Below is the content of sysctls.out.
[root@ip-172-31-73-171 aws-routed-eni]# less sysctls.out
================== sysctls ==================
/proc/sys/net/ipv4/conf/all/rp_filter = 1
/proc/sys/net/ipv4/conf/default/rp_filter = 1
/proc/sys/net/ipv4/conf/eth0/rp_filter = 1

Regards,
William Yao

@williamyao1982
Copy link
Author

Sorry forgot to introduce the cluster version.
k8s version: 1.12.9
CNI version: amazon-k8s-cni:1.3.3

@williamyao1982
Copy link
Author

Attached sysctl settings for k8s worknode.
[root@ip-172-31-52-186 sysctl.d]# vi 99-k8s-general.conf

Kubernetes Settings

vm.max_map_count = 262144

kernel.softlockup_panic = 1
kernel.softlockup_all_cpu_backtrace = 1

net.ipv4.ip_local_reserved_ports = 30000-32767

Increase the number of connections

net.core.somaxconn = 32768

Maximum Socket Receive Buffer

net.core.rmem_max = 16777216

Default Socket Send Buffer

net.core.wmem_max = 16777216

Increase the maximum total buffer-space allocatable

net.ipv4.tcp_wmem = 4096 12582912 16777216
net.ipv4.tcp_rmem = 4096 12582912 16777216

Increase the number of outstanding syn requests allowed

net.ipv4.tcp_max_syn_backlog = 8096

For persistent HTTP connections

net.ipv4.tcp_slow_start_after_idle = 0

Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks

net.ipv4.tcp_tw_reuse = 1

Max number of packets that can be queued on interface input

If kernel is receiving packets faster than can be processed

this queue increases

net.core.netdev_max_backlog = 16384

Increase size of file handles and inode cache

fs.file-max = 2097152

Max number of inotify instances and watches for a user

Since dockerd runs as a single user, the default instances value of 128 per user is too low

e.g. uses of inotify: nginx ingress controller, kubectl logs -f

fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 524288

AWS settings

Issue #23395

net.ipv4.neigh.default.gc_thresh1=0

Prevent docker from changing iptables: kubernetes/kubernetes#40182

net.ipv4.ip_forward=1

@williamyao1982
Copy link
Author

I used tcpdump to capture network package. I fount it will re-transmission the package as below. But finally it will timeout.

17 4.310309 172.31.127.10 172.31.73.171 TCP 66 45072 > 30093 [ACK] Seq=84 Ack=240 Win=28032 Len=0 TSval=1179362178 TSecr=411319205
18 4.310353 172.31.127.10 172.31.73.171 TCP 66 45072 > 30093 [ACK] Seq=84 Ack=2212 Win=32000 Len=0 TSval=1179362178 TSecr=411319205
19 4.310561 172.31.127.10 172.31.73.171 TCP 66 45072 > 30093 [FIN, ACK] Seq=84 Ack=2212 Win=32000 Len=0 TSval=1179362178 TSecr=411319205
20 4.311943 172.31.127.10 172.31.73.171 TCP 66 45072 > 30093 [ACK] Seq=85 Ack=2213 Win=32000 Len=0 TSval=1179362180 TSecr=411319207
21 5.078943 172.31.127.10 172.31.73.171 TCP 74 45074 > 30093 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=1179362947 TSecr=0 WS=128
22 6.079667 172.31.127.10 172.31.73.171 TCP 74 [TCP Retransmission] 45074 > 30093 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=1179363948 TSecr=0 WS=128
23 8.083661 172.31.127.10 172.31.73.171 TCP 74 [TCP Retransmission] 45074 > 30093 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=1179365952 TSecr=0 WS=128
24 12.091657 172.31.127.10 172.31.73.171 TCP 74 [TCP Retransmission] 45074 > 30093 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=1179369960 TSecr=0 WS=128
25 20.115661 172.31.127.10 172.31.73.171 TCP 74 [TCP Retransmission] 45074 > 30093 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=1179377984 TSecr=0 WS=128
32 36.147661 172.31.127.10 172.31.73.171 TCP 74 [TCP Retransmission] 45074 > 30093 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=1179394016 TSecr=0 WS=128
39 68.179665 172.31.127.10 172.31.73.171 TCP 74 [TCP Retransmission] 45074 > 30093 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=1179426048 TSecr=0 WS=128

Any idea?

Regards,
William Yao

@mogren
Copy link
Contributor

mogren commented Aug 21, 2019

Hi @williamyao1982, thanks for reporting the issue. What instance type are you using? Also, could you try a newer version of the CNI? Preferably v1.5.3, but if that shows the same issue, could you try v1.4.1?

@williamyao1982
Copy link
Author

Thanks @mogren. I am using AWS china cloud. The instance type is r5.xlarge. I can't try new version. Because it's a risk for us if it failed. Any idea to fix this kind of issue?

@mogren
Copy link
Contributor

mogren commented Sep 27, 2019

Is this still an issue? Please try with the latest CNI.

Also, this could be related to https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

@mogren mogren closed this as completed Sep 27, 2019
@williamyao1982
Copy link
Author

@mogren Thanks for your support. We already changed CNI to flannel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants