Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky] Services should create endpoints for unready pods - Netfilter Bug #12

Open
aojea opened this issue Apr 22, 2024 · 27 comments · Fixed by kubernetes/kubernetes#125681
Assignees

Comments

@aojea
Copy link
Contributor

aojea commented Apr 22, 2024

It flakes on the network policy jobs https://testgrid.k8s.io/sig-network-gce#network-policies,%20google-gce

Analyzing one occurrence https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-network-policies/1782274345247182848

I can see DNS request on the trace but never a try to connect to the Service, that may indicate that something is dealing with the DNS request or response

@aojea
Copy link
Contributor Author

aojea commented Apr 23, 2024

I can reproduce the problem, it seems that when using curl sometimes it take too long to answer and times out ... surprisingly if I use ping or nslookup it replies fast ...

@aojea aojea self-assigned this Apr 23, 2024
@aojea
Copy link
Contributor Author

aojea commented Apr 23, 2024

time nslookup tolerate-unready.services-5834.svc.cluster.local
Server:         10.0.0.10
Address:        10.0.0.10#53

Name:   tolerate-unready.services-5834.svc.cluster.local
Address: 10.0.35.116


real    0m0.389s
user    0m0.009s
sys     0m0.011s
bash-5.0# curl -w "time_namelookup: %{time_namelookup}\n" http://tolerate-unready.services-5834.svc.cluster.local:80/
curl: (7) Failed to connect to tolerate-unready.services-5834.svc.cluster.local port 80 after 5060 ms: Connection refused
time_namelookup: 5.059978
bash-5.0# curl -w "time_namelookup: %{time_namelookup}\n" http://tolerate-unready.services-5834.svc.cluster.local:80/
curl: (7) Failed to connect to tolerate-unready.services-5834.svc.cluster.local port 80 after 15025 ms: Connection refused
time_namelookup: 15.025116

it is only with curl, nslookup works perfectly and fast

 curl --version
curl 7.79.1 (x86_64-alpine-linux-musl) libcurl/7.79.1 OpenSSL/1.1.1n zlib/1.2.12 nghttp2/1.41.0
Release-Date: 2021-09-22
Protocols: dict file ftp ftps gopher gophers http https imap imaps mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS HSTS HTTP2 HTTPS-proxy IPv6 Largefile libz NTLM NTLM_WB SSL TLS-SRP UnixSockets
cat /etc/resolv.conf
search services-5834.svc.cluster.local svc.cluster.local cluster.local us-central1-b.c.ipv6-project-379110.internal c.ipv6-project-379110.internal google.internal
nameserver 10.0.0.10
options ndots:5

cc: @thockin @BenTheElder

hmm, is everything in this specific container that is failing dns queries ??

tcpdump inside the container gets the resolved dns request but curl does not use it

    10.0.0.10.53 > 10.64.3.24.54585: 45327 NXDomain 0/1/0 (187)
13:47:05.465786 IP (tos 0x0, ttl 62, id 36466, offset 0, flags [DF], proto UDP (17), length 215)
    10.0.0.10.53 > 10.64.3.24.54585: 44862 NXDomain 0/1/0 (187)
13:47:05.466004 IP (tos 0x0, ttl 64, id 55265, offset 0, flags [DF], proto UDP (17), length 94)
    10.64.3.24.48233 > 10.0.0.10.53: 63599+ A? tolerate-unready.services-5834.svc.cluster.local. (66)
13:47:05.466463 IP (tos 0x0, ttl 64, id 55266, offset 0, flags [DF], proto UDP (17), length 94)
    10.64.3.24.48233 > 10.0.0.10.53: 63984+ AAAA? tolerate-unready.services-5834.svc.cluster.local. (66)
13:47:05.467909 IP (tos 0x0, ttl 62, id 36468, offset 0, flags [DF], proto UDP (17), length 158)
    10.0.0.10.53 > 10.64.3.24.48233: 63599*- 1/0/0 tolerate-unready.services-5834.svc.cluster.local. A 10.0.35.116 (130)

@aojea
Copy link
Contributor Author

aojea commented Apr 23, 2024

hmm

[Tue Apr 23 13:51:19 2024] IPv4: martian source 10.64.3.4 from 10.64.3.24, on dev eth0
[Tue Apr 23 13:51:19 2024] ll header: 00000000: 42 01 0a 28 00 05 42 01 0a 28 00 01 08 00
default         dnsutils                                     1/1     Running   0               92m     10.64.3.16   kubernetes-minion-group-7kcf   <none>           <none>
kube-system     coredns-58cd89f5d-hq42m                      1/1     Running   0               5h3m    10.64.3.4    kubernetes-minion-group-7kcf   <none>           <none>
kube-system     konnectivity-agent-bqj72                     1/1     Running   0               5h3m    10.64.3.3    kubernetes-minion-group-7kcf   <none>           <none>
kube-system     node-problem-detector-bv8n7                  1/1     Running   0               6h47m   10.64.3.2    kubernetes-minion-group-7kcf   <none>           <none>
services-5834   execpod-wtzs2                                1/1     Running   0               20m     10.64.3.24   kubernetes-minion-group-7kcf   <none>           <none>

@aojea
Copy link
Contributor Author

aojea commented Apr 23, 2024

heh, so my theory of being conservative supporting iptables-legacy was a disaster, @danwinship he asked me about it just yesterday ... so it turns out that if I remove the iptables rule DNS starts working perfectly

-A FORWARD -m conntrack --ctstate NEW -j NFQUEUE --queue-num 100

There is a bug somehow with nfqueue that causes this behavior

setting net.ipv4.conf.all.rp_filter to 2 stops the martian errors but still keeps failing ... better to ditch iptables-legacy

@aojea
Copy link
Contributor Author

aojea commented Apr 25, 2024

/close

as not planned, we'll use only nftables

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

/close

as not planned, we'll use only nftables

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Contributor Author

aojea commented Apr 25, 2024

/reopen

kubernetes/kubernetes#124532

@k8s-ci-robot k8s-ci-robot reopened this Apr 25, 2024
@k8s-ci-robot
Copy link
Contributor

@aojea: Reopened this issue.

In response to this:

/reopen

kubernetes/kubernetes#124532

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aojea
Copy link
Contributor Author

aojea commented Apr 30, 2024

it seems is a netfilter problem unrelated to iptables/nftables

@aojea
Copy link
Contributor Author

aojea commented May 4, 2024

I hate this flake, it is pretty weird

  1. it seems it only affects one test, the pod unready one ...
  2. when reproduced is only affecting curl dns request, other userspace dns request work fine
  3. last 12 runs in the periodic are completely green https://testgrid.k8s.io/sig-network-gce#network-policies,%20google-gce
  4. the presubmit is worse https://testgrid.k8s.io/sig-network-gce#presubmit-network-policies,%20google-gce&width=20

@danwinship
Copy link

[Tue Apr 23 13:51:19 2024] IPv4: martian source 10.64.3.4 from 10.64.3.24, on dev eth0

so that suggests that nfqueue is somehow causing a packet to get moved to the wrong interface?

it seems it only affects one test, the pod unready one ...

Do any other tests in this job rely on service DNS? Most of them use IPs I think...

@aojea
Copy link
Contributor Author

aojea commented May 5, 2024

It runs all the DNS sig network tests
When I reproduced it , I connected to the pod and DNS works well, its only one the curl specific request

The martian thing seems something to explore, both pods are in the same host ... maybe is how gce implement pod aliases? I need to check that

@thockin
Copy link

thockin commented May 6, 2024

IPv4: martian source 10.64.3.4 from 10.64.3.24, on dev eth0

Can we turn that error into comprehensible english?

For context: Kernel log line is https://github.com/torvalds/linux/blob/ee5b455b0adae9ecafb38b174c648c48f2a3c1a5/net/ipv4/route.c#L1775

IP 10.64.3.4 is daddr (coredns)
IP 10.64.3.24 is saddr (execpod)

Device name "eth0" has IP 10.64.3.4 inside the CoreDNS pod, so that must be where the error is triggering? But why would it think that is martian? eth0 should be set up as a /24 (or similar) subnet.

@aojea
Copy link
Contributor Author

aojea commented May 14, 2024

captured a pcap of one failure, appreciate if someone can take a look to it ... the symptom is that dns queries are too slow, so when the query goes through all the search domains dance is practically impossible it resolves in 2 seconds (curl timeout in the test)

traffic.pcap.zip

@thockin
Copy link

thockin commented May 15, 2024

Not super helpful but a few notes:

It looks like it is issuing a lot of duplicate queries very quickly, or else the pcap is "seeing" things twice. Maybe multiple threads?

It does get a response @ 436, which is un-DNAT'ed @ 437

It looks like there are at least 2 DNS replicas in play: 10.64.1.10 and 10.64.3.4

@aojea
Copy link
Contributor Author

aojea commented May 17, 2024

It looks like it is issuing a lot of duplicate queries very quickly, or else the pcap is "seeing" things twice. Maybe multiple threads?

it sees packets twice, see some of them are the same packet with the DNAT (svcIP/podIP)

Interesting , similar issue also with coredns pods involved weaveworks/weave#3327

The resolv.conf here is like this

cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-central1-b.c.ipv6-project-379110.internal c.ipv6-project-379110.internal google.internal
nameserver 10.0.0.10
options ndots:5

The repetitive queries are from the test , it tries to connect to the service and the dns resolution continously timeouts,

516 23:57:01.567511 10314 resource.go:361] Creating new exec pod
I0516 23:57:07.833367 10314 builder.go:121] Running '/workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://34.82.60.155 --kubeconfig=/workspace/.kube/config --namespace=services-6513 exec execpod-2g8f9 -- /bin/sh -x -c curl -q -s --connect-timeout 2 http://tolerate-unready.services-6513.svc.cluster.local:80/'
I0516 23:57:10.375990 10314 builder.go:135] rc: 28
I0516 23:57:10.376053 10314 service.go:1804] expected un-ready endpoint for Service slow-terminating-unready-pod, stdout: , err error running /workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://34.82.60.155 --kubeconfig=/workspace/.kube/config --namespace=services-6513 exec execpod-2g8f9 -- /bin/sh -x -c curl -q -s --connect-timeout 2 http://tolerate-unready.services-6513.svc.cluster.local:80/:
Command stdout:

stderr:
+ curl -q -s --connect-timeout 2 http://tolerate-unready.services-6513.svc.cluster.local:80/
command terminated with exit code 28

error:
exit status 28
I0516 23:57:10.376107 10314 builder.go:121] Running '/workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://34.82.60.155 --kubeconfig=/workspace/.kube/config --namespace=services-6513 exec execpod-2g8f9 -- /bin/sh -x -c curl -q -s --connect-timeout 2 http://tolerate-unready.services-6513.svc.cluster.local:80/'
I0516 23:57:12.820856 10314 builder.go:135] rc: 28
I0516 23:57:12.820916 10314 service.go:1804] expected un-ready endpoint for Service slow-terminating-unready-pod, stdout: , err error running /workspace/kubernetes/platforms/linux/amd64/kubectl --server=https://34.82.60.155 --kubeconfig=/workspace/.kube/config --namespace=services-6513 exec execpod-2g8f9 -- /bin/sh -x -c curl -q -s --connect-timeout 2 http://tolerate-unready.services-6513.svc.cluster.local:80/:
Command stdout:

stderr:
+ curl -q -s --connect-timeout 2 http://tolerate-unready.services-6513.svc.cluster.local:80/
command terminated with exit code 28

as you can see in the pcap, the one that succeed was me that connected to the pod and performed a nslookup ... increasing the curl timeout will reduce flakiness but taking more than 2 seconds to resolve a service name worries me

@aojea
Copy link
Contributor Author

aojea commented May 17, 2024

hmm, that sounds like a red herring

The pcaps show there is no recursive query for the dns search 🤔 is that the response is not being received? or something with muslc ? or ...

image

@aojea
Copy link
Contributor Author

aojea commented May 17, 2024

the A and AAAA requests for the same record share the same source port, a possible reason can be this interact weirdly with the nftables logic opnsense/core#4708

@thockin
Copy link

thockin commented May 17, 2024

it sees packets twice, see some of them are the same packet with the DNAT (svcIP/podIP)

The DNAT ones are easy because they IP changes. But there are a lot of what looks like identical packets. I wonder if the capture is see it on both sides of the veth?

@aojea
Copy link
Contributor Author

aojea commented May 18, 2024

There are also packets that are lost os duplicates

image

Checking the captured headers shows that the packet is like going out and coming from the eth0 device

 ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 42:01:0a:28:00:05 brd ff:ff:ff:ff:ff:ff

image

@aojea
Copy link
Contributor Author

aojea commented May 18, 2024

@thockin
Copy link

thockin commented May 20, 2024

How dod you capture the data? The second image suggests two different contexts - Left is "to us" (i.e. inside the DNS pod) and right is "from us" (i.e. inside the client pod). Maybe the MAC changes because of the root netns routing? I've never looked at that behavior in detail.

@aojea
Copy link
Contributor Author

aojea commented May 21, 2024

How dod you capture the data? The second image suggests two different contexts - Left is "to us" (i.e. inside the DNS pod) and right is "from us" (i.e. inside the client pod). Maybe the MAC changes because of the root netns routing? I've never looked at that behavior in detail.

I have an internal cluster created with cluster-up.sh , and if you run the e2e test several times it eventually triggers the behavior. I modify the test timeouts to hold for hours so the environment is not wiped out ... ping me offline to get credentials and more details instructions if you want to experiment and reproduce

@aojea
Copy link
Contributor Author

aojea commented Jul 2, 2024

no more flakes, most probable a race with same tuple on different packets on the nfqueue/netfilter logic ... smells like a kernel bug

https://testgrid.k8s.io/sig-network-gce#network-policies,%20google-gce&width=20

@aojea
Copy link
Contributor Author

aojea commented Aug 24, 2024

Reopening, opening bug in netfilter, the bugzilla is down so copying here for reference

Title: nfqueue random output behavior with packets with same tuple

Description:

I was puzzled by this problem for a long time, first reported in
#12 and
now reported in kubernetes-sigs/kind#3713

It seems I was able to narrow down the scenario, I will try to
translate the kubernetes constructs to namespaces and node to describe
better the scenario.

2 nodes: N1 and N2

N1 contains two containers:

  • client C1 (10.244.1.3)
  • DNS server D1 (10.244.1.5)
    N2 containers the second DNS server D2 (10.244.2.4)

One rule to send the packets to nfqueue in postrouting, but it
happened in other hooks before. We can assume the set matches the
packet and the nfqueue userspace always accept the packet

  chain postrouting {
           type filter hook postrouting priority srcnat - 5; policy accept;
           icmpv6 type { nd-neighbor-solicit, nd-neighbor-advert } accept
           meta skuid 0 accept
           ct state established,related accept
           ip saddr @podips-v4 queue flags bypass to 100 comment "process IPv4 traffic with network policy enforcement"
           ip daddr @podips-v4 queue flags bypass to 100 comment "process IPv4 traffic with network policy enforcement"
           ip6 saddr @podips-v6 queue flags bypass to 100 comment "process IPv6 traffic with network policy enforcement"
           ip6 daddr @podips-v6 queue flags bypass to 100 comment "process IPv6 traffic with network policy enforcement"
   }

The containerd DNS servers are abstracted via DNAT with IP 10.96.0.10

meta l4proto udp ip daddr 10.96.0.10 udp dport 53 counter packets 0 bytes 0 jump KUBE-SVC-TCOU7JCQXEZGVUNU

  chain KUBE-SVC-TCOU7JCQXEZGVUNU {
           meta l4proto udp ip saddr != [10.244.0.0/16](http://10.244.0.0/16) ip daddr 10.96.0.10  udp dport 53 counter packets 0 bytes 0 jump KUBE-MARK-MASQ
            meta random & 2147483647 < 1073741824 counter packets 38 bytes 2280 jump KUBE-SEP-CEYPGFB7VCORONY3
            counter packets 32 bytes 1920 jump KUBE-SEP-RJHMR3QLYGJVBWVL
   }
   chain KUBE-SEP-CEYPGFB7VCORONY3 {
           ip saddr 10.244.1.5  counter packets 0 bytes 0 jump KUBE-MARK-MASQ
           meta l4proto udp   counter packets 38 bytes 2280 dnat to [10.244.1.5:53](http://10.244.1.5:53/)
   }

C1 sends a DNS request to the virtual ip 10.96.0.10, because of the
happy-eyeball protocol, it sends two packets with the same tuple for
each record A and AAAA
The symptom is that one of the packets does not come back ... see
tcpdump trace, the packets go out at 22:49:07 but only the A answer
comes back, the client retries at 22:49:10 the AAAA and this times
come back

22:49:07.632846 vetha5c90841 In  IP (tos 0x0, ttl 64, id 52468, offset
0, flags [DF], proto UDP (17), length 60)
    10.244.1.3.48199 > 10.96.0.10.53: 60169+ A? [www.google.com](http://www.google.com/). (32)
22:49:07.632909 vetha5c90841 In  IP (tos 0x0, ttl 64, id 52469, offset
0, flags [DF], proto UDP (17), length 60)
    10.244.1.3.48199 > 10.96.0.10.53: 60459+ AAAA? [www.google.com](http://www.google.com/). (32)
22:49:07.633080 veth271ea3e0 Out IP (tos 0x0, ttl 63, id 52468, offset
0, flags [DF], proto UDP (17), length 60)
    10.244.1.3.48199 > 10.244.1.5.53: 60169+ A? [www.google.com](http://www.google.com/). (32)
22:49:07.633210 eth0  Out IP (tos 0x0, ttl 63, id 52469, offset 0,
flags [DF], proto UDP (17), length 60)
    10.244.1.3.48199 > 10.244.1.5.53: 60459+ AAAA? [www.google.com](http://www.google.com/). (32)
22:49:07.633352 eth0  In  IP (tos 0x0, ttl 62, id 52469, offset 0,
flags [DF], proto UDP (17), length 60)
    10.244.1.3.48199 > 10.244.1.5.53: 60459+ AAAA? [www.google.com](http://www.google.com/). (32)
22:49:07.653981 veth271ea3e0 In  IP (tos 0x0, ttl 64, id 28750, offset
0, flags [DF], proto UDP (17), length 240)
    10.244.1.5.53 > 10.244.1.3.48199: 60169 6/0/0 [www.google.com](http://www.google.com/). A
172.217.218.104, [www.google.com](http://www.google.com/). A 172.217.218.99, [www.google.com](http://www.google.com/). A
172.217.218.106, [www.google.com](http://www.google.com/). A 172.217.218.147, [www.google.com](http://www.google.com/). A
172.217.218.105, [www.google.com](http://www.google.com/). A 172.217.218.103 (212)
22:49:07.654012 vetha5c90841 Out IP (tos 0x0, ttl 63, id 28750, offset
0, flags [DF], proto UDP (17), length 240)
    10.96.0.10.53 > 10.244.1.3.48199: 60169 6/0/0 [www.google.com](http://www.google.com/). A
172.217.218.104, [www.google.com](http://www.google.com/). A 172.217.218.99, [www.google.com](http://www.google.com/). A
172.217.218.106, [www.google.com](http://www.google.com/). A 172.217.218.147, [www.google.com](http://www.google.com/). A
172.217.218.105, [www.google.com](http://www.google.com/). A 172.217.218.103 (212)
22:49:10.135710 vetha5c90841 In  IP (tos 0x0, ttl 64, id 52470, offset
0, flags [DF], proto UDP (17), length 60)
    10.244.1.3.48199 > 10.96.0.10.53: 60459+ AAAA? [www.google.com](http://www.google.com/). (32)
22:49:10.135740 veth271ea3e0 Out IP (tos 0x0, ttl 63, id 52470, offset
0, flags [DF], proto UDP (17), length 60)
    10.244.1.3.48199 > 10.244.1.5.53: 60459+ AAAA? [www.google.com](http://www.google.com/). (32)
22:49:10.136635 veth271ea3e0 In  IP (tos 0x0, ttl 64, id 28842, offset
0, flags [DF], proto UDP (17), length 228)
    10.244.1.5.53 > 10.244.1.3.48199: 60459 4/0/0 [www.google.com](http://www.google.com/). AAAA
2a00:1450:4013:c08::6a, [www.google.com](http://www.google.com/). AAAA 2a00:1450:4013:c08::67,
[www.google.com](http://www.google.com/). AAAA 2a00:1450:4013:c08::63, [www.google.com](http://www.google.com/). AAAA
2a00:1450:4013:c08::68 (200)
22:49:10.136669 vetha5c90841 Out IP (tos 0x0, ttl 63, id 28842, offset
0, flags [DF], proto UDP (17), length 228)
    10.96.0.10.53 > 10.244.1.3.48199: 60459 4/0/0 [www.google.com](http://www.google.com/). AAAA
2a00:1450:4013:c08::6a, [www.google.com](http://www.google.com/). AAAA 2a00:1450:4013:c08::67,
[www.google.com](http://www.google.com/). AAAA 2a00:1450:4013:c08::63, [www.google.com](http://www.google.com/). AAAA
2a00:1450:4013:c08::68 (200)
^C
23 packets captured

When tracing the packets I could observer two different reasons for
dropping, depending the destination of the DNAT rule, if is local it
is dropped by SKB_DROP_REASON_IP_RPFILTER if is in the other node it
is dropped by SKB_DROP_REASON_NEIGH_FAILED

0xffff9527290acb00 3 [(3178406)]
kfree_skb_reason(SKB_DROP_REASON_IP_RPFILTER) 1289
netns=4026533244 mark=0x0 iface=52(eth0) proto=0x0800 mtu=1500 len=60
10.244.1.3:48199->10.244.1.5:53(udp)

and

3:24:37.411 0xffff9534c19a3d00 7 [(0)]
kfree_skb_reason(SKB_DROP_REASON_NEIGH_FAILED) 1583194220087332
netns=4026533244 mark=0x0 iface=5(veth271ea3e0) proto=0x0800 mtu=1500
len=60 10.244.1.3:58611->10.244.2.4:53(udp)

If I enable martian logging net.ipv4.conf.all.log_martians=1 it also
reports these packets as martians when the destination is in the same
node

[1581593.716839] IPv4: martian source 10.244.1.5 from 10.244.1.3, on dev eth0
[1581593.723848] ll header: 00000000: 02 42 c0 a8 08 05 02 42 c0 a8 08 03 08 00

An interesting detail is that only seems to happen with DNS (2 packets
with the same tuple) and when there is more than 1 replica behind the
virtual IP (> 1 DNAT rules) . When there is only 1 DNAT rule it does
not happen, this is a fact.

Since the behavior is not deterministic but reproducible, it makes me
think that there is some kind of race where the nfqueue system is not
able to correctly handle the two packets with the same tuple on the
return path and it goes dropped ...

I would like some help on two fronts:

  • advices on the next steps to debugging further or how can I provide
    more information that can help maintainer
  • advices on how to workaround temporarily this problem

@aojea
Copy link
Contributor Author

aojea commented Aug 26, 2024

@aojea aojea changed the title [Flaky] Services should create endpoints for unready pods [Flaky] Services should create endpoints for unready pods - Netfilter Bug Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants