Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot access application running on a pod of another Node #5164

Closed
csokafor opened this issue Jan 1, 2024 · 7 comments
Closed

Cannot access application running on a pod of another Node #5164

csokafor opened this issue Jan 1, 2024 · 7 comments

Comments

@csokafor
Copy link

csokafor commented Jan 1, 2024

Environmental Info:
RKE2 Version:
rke2 version v1.28.5+rke2r1 (adcd936)
go version go1.20.12 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
Linux 5.4.17-2136.324.5.3.el8uek.x86_64 #2 SMP Tue Oct 10 12:43:39 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux
Oracle Linux Server 8.8

Cluster Configuration:
3 server nodes, cni- Canal

Describe the bug:
I deployed a pod on RKE2 cluster. I can connect to the pod from the node where pod is hosted but I can't connect to the pod from other nodes. The pod is deployed on node app007 and IP is 10.42.2.3
kubectl get pods -o wide output
prometheus-68dfb8ff68-vc2tg 1/1 Running 0 13h 10.42.2.3 app007 <none> <none>

From node app007 I can connect to application running on the pod on port 9090

$ nc -zv 10.42.2.3 9090
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.42.2.3:9090.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

From other nodes, I can't connect to the application running on the pod on port 9090 .

$ nc -zv 10.42.2.3 9090
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.

I can ping the pod IP 10.42.2.3 from all the nodes.

Steps To Reproduce:

  • Air-gapped installed RKE2 using Tarball method and default Canal cni.
  • Create deployment with single pod instance

Expected behavior:

The application running on a pod should be accessible from any node on the cluster.

Additional context / logs:

UDP port 8472 is open on all nodes

$ nc -uzv 10.1.154.107 8472
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.1.154.107:8472.
Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 0 bytes received in 2.01 seconds.

I can also ping the pod IP 10.42.2.3 from other nodes and see the request and reply in tcpdump.

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
11:52:26.324251 3e:c3:a1:5e:c7:d7 > b6:8d:22:bf:cd:54, ethertype IPv4 (0x0800), length 98: 10.42.0.0 > 10.42.2.3: ICMP echo request, id 9, seq 1, length 64
11:52:26.324422 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 98: 10.42.2.3 > 10.42.0.0: ICMP echo reply, id 9, seq 1, length 64
@manuelbuil
Copy link
Contributor

You can ping the pod from all nodes but you can't access its server on port 9090 from other nodes, right? Is it possible that you have wrong network policies? Network policies don't apply to localNode-pod communication, which could explain what you see

Are you able to track where the TCP packet with dest-port 9090 gets dropped?

@csokafor
Copy link
Author

csokafor commented Jan 2, 2024

I did a tcpdump of curl request from one of the nodes, curl http://10.42.2.3:9090/ and I received the tcpdump below.
sudo tcpdump -i any -n host 10.42.2.3 and port 9090

dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
13:12:14.724275 IP 10.42.0.0.46870 > 10.42.2.3.websm: Flags [S], seq 601263222, win 64860, options [mss 1410,sackOK,TS val 2029058051 ecr 0,nop,wscale 7], length 0
13:12:14.724666 IP 10.42.2.3.websm > 10.42.0.0.46870: Flags [S.], seq 2479595326, ack 601263223, win 64308, options [mss 1410,sackOK,TS val 3185599473 ecr 2029058051,nop,wscale 7], length 0
13:12:15.728310 IP 10.42.0.0.46870 > 10.42.2.3.websm: Flags [S], seq 601263222, win 64860, options [mss 1410,sackOK,TS val 2029059055 ecr 0,nop,wscale 7], length 0
13:12:15.728704 IP 10.42.2.3.websm > 10.42.0.0.46870: Flags [S.], seq 2479595326, ack 601263223, win 64308, options [mss 1410,sackOK,TS val 3185600477 ecr 2029058051,nop,wscale 7], length 0
13:12:16.776827 IP 10.42.2.3.websm > 10.42.0.0.46870: Flags [S.], seq 2479595326, ack 601263223, win 64308, options [mss 1410,sackOK,TS val 3185601526 ecr 2029058051,nop,wscale 7], length 0
13:12:17.776307 IP 10.42.0.0.46870 > 10.42.2.3.websm: Flags [S], seq 601263222, win 64860, options [mss 1410,sackOK,TS val 2029061103 ecr 0,nop,wscale 7], length 0
13:12:17.776693 IP 10.42.2.3.websm > 10.42.0.0.46870: Flags [S.], seq 2479595326, ack 601263223, win 64308, options [mss 1410,sackOK,TS val 3185602525 ecr 2029058051,nop,wscale 7], length 0
13:12:19.784739 IP 10.42.2.3.websm > 10.42.0.0.46870: Flags [S.], seq 2479595326, ack 601263223, win 64308, options [mss 1410,sackOK,TS val 3185604534 ecr 2029058051,nop,wscale 7], length 0
13:12:21.808301 IP 10.42.0.0.46870 > 10.42.2.3.websm: Flags [S], seq 601263222, win 64860, options [mss 1410,sackOK,TS val 2029065135 ecr 0,nop,wscale 7], length 0

There is SYN request to the pod IP 10.42.2.3 and a SYN ACK response but afterwards connection is not established, and there are repeated attempts to reconnect.

Please how can I determine why the TCP connection is dropped or not established?

@manuelbuil
Copy link
Contributor

Can you verify if the curl client is responding to the SYN-ACK? Do tcpdump on the client's calico interface.

Can you please show the output of kubectl get netpol -A?

@csokafor
Copy link
Author

csokafor commented Jan 3, 2024

No network policy was defined. This is the response for kubectl get netpol -A on all nodes

$ kubectl get netpol -A
No resources found

This is the ip r response from the 3 nodes.

app005 $ ip r
default via 10.1.154.254 dev bond0 proto static metric 300
10.1.154.0/24 dev bond0 proto kernel scope link src 10.1.154.105 metric 300
10.42.0.3 dev calic18d2c00f6d scope link
10.42.0.8 dev calia0fc86af208 scope link
10.42.0.9 dev cali752247fb2e7 scope link
10.42.0.10 dev calif72e8e10641 scope link
10.42.0.12 dev cali025731e3507 scope link
10.42.0.13 dev cali82e3f3595d8 scope link
10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink
10.42.2.0/24 via 10.42.2.0 dev flannel.1 onlink
10.88.0.0/16 dev cni-podman0 proto kernel scope link src 10.88.0.1
app006 $ ip r
default via 10.1.154.254 dev bond0 proto static metric 300
10.1.154.0/24 dev bond0 proto kernel scope link src 10.1.154.106 metric 300
10.42.0.0/24 via 10.42.0.0 dev flannel.1 onlink
10.42.1.2 dev cali7df80df9864 scope link
10.42.1.3 dev cali8817480c965 scope link
10.42.2.0/24 via 10.42.2.0 dev flannel.1 onlink
10.88.0.0/16 dev cni-podman0 proto kernel scope link src 10.88.0.1
app007 $ ip r
default via 10.1.154.254 dev bond0 proto static metric 300
10.1.154.0/24 dev bond0 proto kernel scope link src 10.1.154.107 metric 300
10.42.0.0/24 via 10.42.0.0 dev flannel.1 onlink
10.42.1.0/24 via 10.42.1.0 dev flannel.1 onlink
10.42.2.2 dev cali68cf6a76fe0 scope link
10.42.2.3 dev calic6a1e5d0215 scope link
10.88.0.0/16 dev cni-podman0 proto kernel scope link src 10.88.0.1

The verbose curl response is shown below.

app005 $ curl -vvv http://10.42.2.3:9090/
*   Trying 10.42.2.3...
* TCP_NODELAY set
* connect to 10.42.2.3 port 9090 failed: Connection timed out
* Failed to connect to 10.42.2.3 port 9090: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to 10.42.2.3 port 9090: Connection timed out

The request was sent from node app005 to Pod 10.42.2.3 through the flannel interface, I didn't get any tcpdump on cali interface for the request.
This is the flannel interface tcpdump.

app005 $ sudo tcpdump -vv -eni flannel.1 host 10.42.2.3 and port 9090
dropped privs to tcpdump
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:03:54.761573 3e:c3:a1:5e:c7:d7 > b6:8d:22:bf:cd:54, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 51446, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.0.0.35904 > 10.42.2.3.websm: Flags [S], cksum 0x1685 (incorrect -> 0x147c), seq 1606685503, win 64860, options [mss 1410,sackOK,TS val 2125758088 ecr 0,nop,wscale 7], length 0
16:03:54.762245 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.2.3.websm > 10.42.0.0.35904: Flags [S.], cksum 0x1685 (incorrect -> 0x6b7b), seq 214691888, ack 1606685504, win 64308, options [mss 1410,sackOK,TS val 3282299511 ecr 2125758088,nop,wscale 7], length 0
16:03:55.784894 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.2.3.websm > 10.42.0.0.35904: Flags [S.], cksum 0x1685 (incorrect -> 0x677c), seq 214691888, ack 1606685504, win 64308, options [mss 1410,sackOK,TS val 3282300534 ecr 2125758088,nop,wscale 7], length 0
16:03:55.824303 3e:c3:a1:5e:c7:d7 > b6:8d:22:bf:cd:54, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 51447, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.0.0.35904 > 10.42.2.3.websm: Flags [S], cksum 0x1685 (incorrect -> 0x1055), seq 1606685503, win 64860, options [mss 1410,sackOK,TS val 2125759151 ecr 0,nop,wscale 7], length 0
16:03:55.824699 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.2.3.websm > 10.42.0.0.35904: Flags [S.], cksum 0x1685 (incorrect -> 0x6755), seq 214691888, ack 1606685504, win 64308, options [mss 1410,sackOK,TS val 3282300573 ecr 2125758088,nop,wscale 7], length 0
16:03:57.832900 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.2.3.websm > 10.42.0.0.35904: Flags [S.], cksum 0x1685 (incorrect -> 0x5f7c), seq 214691888, ack 1606685504, win 64308, options [mss 1410,sackOK,TS val 3282302582 ecr 2125758088,nop,wscale 7], length 0
16:03:57.872303 3e:c3:a1:5e:c7:d7 > b6:8d:22:bf:cd:54, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 51448, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.0.0.35904 > 10.42.2.3.websm: Flags [S], cksum 0x1685 (incorrect -> 0x0855), seq 1606685503, win 64860, options [mss 1410,sackOK,TS val 2125761199 ecr 0,nop,wscale 7], length 0
16:03:57.872672 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.2.3.websm > 10.42.0.0.35904: Flags [S.], cksum 0x1685 (incorrect -> 0x5f55), seq 214691888, ack 1606685504, win 64308, options [mss 1410,sackOK,TS val 3282302621 ecr 2125758088,nop,wscale 7], length 0

This is the flannel interface tcpdump of the node hosting the pod.

app007 $ sudo tcpdump -vv -eni flannel.1 host 10.42.2.3 and port 9090
dropped privs to tcpdump
tcpdump: listening on flannel.1, link-type EN10MB (Ethernet), capture size 262144 bytes
16:03:54.761758 3e:c3:a1:5e:c7:d7 > b6:8d:22:bf:cd:54, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 51446, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.0.0.35904 > 10.42.2.3.websm: Flags [S], cksum 0x147c (correct), seq 1606685503, win 64860, options [mss 1410,sackOK,TS val 2125758088 ecr 0,nop,wscale 7], length 0
16:03:54.762238 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.2.3.websm > 10.42.0.0.35904: Flags [S.], cksum 0x1685 (incorrect -> 0x6b7b), seq 214691888, ack 1606685504, win 64308, options [mss 1410,sackOK,TS val 3282299511 ecr 2125758088,nop,wscale 7], length 0
16:03:55.784816 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.2.3.websm > 10.42.0.0.35904: Flags [S.], cksum 0x1685 (incorrect -> 0x677c), seq 214691888, ack 1606685504, win 64308, options [mss 1410,sackOK,TS val 3282300534 ecr 2125758088,nop,wscale 7], length 0
16:03:55.824528 3e:c3:a1:5e:c7:d7 > b6:8d:22:bf:cd:54, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 51447, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.0.0.35904 > 10.42.2.3.websm: Flags [S], cksum 0x1055 (correct), seq 1606685503, win 64860, options [mss 1410,sackOK,TS val 2125759151 ecr 0,nop,wscale 7], length 0
16:03:55.824663 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.2.3.websm > 10.42.0.0.35904: Flags [S.], cksum 0x1685 (incorrect -> 0x6755), seq 214691888, ack 1606685504, win 64308, options [mss 1410,sackOK,TS val 3282300573 ecr 2125758088,nop,wscale 7], length 0
16:03:57.832797 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.2.3.websm > 10.42.0.0.35904: Flags [S.], cksum 0x1685 (incorrect -> 0x5f7c), seq 214691888, ack 1606685504, win 64308, options [mss 1410,sackOK,TS val 3282302582 ecr 2125758088,nop,wscale 7], length 0
16:03:57.872520 3e:c3:a1:5e:c7:d7 > b6:8d:22:bf:cd:54, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 51448, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.0.0.35904 > 10.42.2.3.websm: Flags [S], cksum 0x0855 (correct), seq 1606685503, win 64860, options [mss 1410,sackOK,TS val 2125761199 ecr 0,nop,wscale 7], length 0
16:03:57.872678 b6:8d:22:bf:cd:54 > 3e:c3:a1:5e:c7:d7, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.2.3.websm > 10.42.0.0.35904: Flags [S.], cksum 0x1685 (incorrect -> 0x5f55), seq 214691888, ack 1606685504, win 64308, options [mss 1410,sackOK,TS val 3282302621 ecr 2125758088,nop,wscale 7], length 0

@manuelbuil
Copy link
Contributor

that incorrect checksum makes me suspect you are hitting this issue #1541 (comment). Try to disable the tx-checksum-ip-generic on the flannel vxlan interface

@csokafor
Copy link
Author

csokafor commented Jan 4, 2024

I disabled tx-checksum-ip-generic and the issue was resolved.

sudo ethtool --offload flannel.1 tx-checksum-ip-generic off

Thanks @manuelbuil

@caroline-suse-rancher
Copy link
Contributor

Closing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants