ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes #1243

nonsense · 2020-01-13T14:34:24Z

I am trying to access a Kubernetes service through its ClusterIP, from a pod that is attached to its host's network and has access to DNS, with:

  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet

However the host machine has no ip routes setup for the service CIDR, for example

➜  ~ k get services
NAME             TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
kubernetes       ClusterIP   100.64.0.1      <none>        443/TCP    25m
redis-headless   ClusterIP   None            <none>        6379/TCP   19m
redis-master     ClusterIP   100.64.63.204   <none>        6379/TCP   19m

➜  ~ k get pods -o wide
NAME                       READY   STATUS      RESTARTS   AGE   IP              NODE                                             NOMINATED NODE   READINESS GATES
redis-master-0             1/1     Running     0          18m   100.96.1.3      ip-172-20-39-241.eu-central-1.compute.internal   <none>           <none>

root@ip-172-20-39-241:/home/admin# ip route
default via 172.20.32.1 dev eth0
10.32.0.0/12 dev weave proto kernel scope link src 10.46.0.0
100.96.0.0/24 via 100.96.0.0 dev flannel.11 onlink
100.96.1.0/24 dev cni0 proto kernel scope link src 100.96.1.1
100.96.2.0/24 via 100.96.2.0 dev flannel.11 onlink
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.20.32.0/19 dev eth0 proto kernel scope link src 172.20.39.241

Expected Behavior

I expect that I should be able to reach services running on Kubernetes from the host machines, but I can only access headless services - those that return a pod IP.

The pod CIDR has ip routes setup, but the services CIDR doesn't.

Current Behavior

Can't access services through their ClusterIPs from host network.

Possible Solution

If I manually add an ip route to 100.64.0.0/16 via 100.96.1.1, ClusterIP are accessible. But this route is not there by default.

Your Environment

Flannel version: v0.11.0
kops version: Version 1.17.0-alpha.1 (git-501baf7e5)
Backend used (e.g. vxlan or udp): vxlan
Kubernetes version (if used):
Operating System and version:
Link to your project (optional):

The text was updated successfully, but these errors were encountered:

choryuidentify · 2020-02-06T08:13:39Z

Exactly same as my experience. My setup is Kubernetes 1.17.2 + Flannel.
when 'hostNetwork: true' is set, this behavior be appearing.

nonsense · 2020-02-06T10:43:47Z

Our workaround is to manually add the route to DNS through a DaemonSet as soon as there is at least one pod running on all workers (so that the cni0 interface appears).

MansM · 2020-02-15T12:01:18Z

issue on kubernetes/kubernetes:
kubernetes/kubernetes#87852

Our workaround is to manually add the route to DNS through a DaemonSet as soon as there is at least one pod running on all workers (so that the cni0 interface appears).

@nonsense have an example?

MansM · 2020-02-15T12:31:59Z

using @mikebryant 's workaround did the trick for me now:
#1245 (comment)

rdxmb · 2020-02-21T08:02:33Z

Just changed to host-gw and realized then that the problem was much bigger than I supposed: There is a big routing problem with k8 1.17 and flannel with vxlan , which affects ClusterIP, NodePorts and even LoadBalancerIPs managed by metallb.

Changing to host-gw fixes all of them. I wonder why this is not fixed or at least documented in a very prominent way.

Here ist my report of response time of a minio-Service (in seconds) before and after changing. The checks run on the nodes itself.

rdxmb · 2020-02-21T08:10:24Z

On a second datacenter, the response time was even more than a minute. I had to increase the monitoring-timeout to get these values.

nonsense · 2020-02-21T14:37:07Z

issue on kubernetes/kubernetes:
kubernetes/kubernetes#87852

Our workaround is to manually add the route to DNS through a DaemonSet as soon as there is at least one pod running on all workers (so that the cni0 interface appears).

@nonsense have an example?

Yes, here it is: https://github.com/ipfs/testground/blob/master/infra/k8s/sidecar.yaml#L23

Note that this won't work, unless you have one pod on every host (i.e. another DaemonSet), so that cni0 exists. I know this is a hack, but I don't have a better solution.

In our case the first pod we expect on every host is s3fs - https://github.com/ipfs/testground/blob/master/infra/k8s/kops-weave/s3bucket.yml

MansM · 2020-02-21T16:49:14Z

@nonsense I fixed it by changing the backend of flannel to host-gw instead of vxlan:

kubectl edit cm -n kube-system kube-flannel-cfg

replace vxlan with host-gw
save
not sure if needed, but I did it anyway: kubectl delete pods -l app=flannel -n kube-system

maybe this works for you as well

MansM · 2020-03-09T06:02:28Z

Setting up a new cluster with flannel and not able to get any communication to work. I tried the host-gw change
kubectl edit cm -n kube-system kube-flannel-cfg
replace vxlan with host-gw

save

not sure if needed, but I did it anyway: kubectl delete pods -l app=flannel -n kube-system

maybe this works for you as well
but the issue persists. Would there be additional changes required? This is just a basic cluster setup and flannel configuration, all from scratch.

If you have issues with all network traffic and not just reaching services from pods hostnetwork: true, you have some other issues

archever · 2020-03-09T07:42:08Z

the same problem to me.
try add a route to cni0 fixed for me:

ip r add 10.96.0.0/16 dev cni0

tobiasvdp · 2020-03-11T08:51:38Z

the 'host-gw' option is only possible to infrastructures that support layer2 interaction.
most cloud providers don't.

davesargrad · 2020-03-17T14:20:58Z

Hi. It turns out that host-gw fixed my problem as well: #1268. To me this is a critical bug somewhere in the vxlan based pipline.

Capitrium · 2020-03-23T20:48:22Z

I had similar issues after upgrading our cluster from 1.16.x to 1.17.x (specifically uswitch/kiam#378). Using host-gw is not an option for me as our cluster runs on AWS, but I was able to fix it by reverting kube-proxy back to 1.16.8.

I also can't reproduce this issue on our dev cluster after replacing kube-proxy with kube-router running in service-proxy mode (tested with v1.0.0-rc1).

Could this issue be caused by changes in kube-proxy?

mariusgrigoriu · 2020-03-31T19:15:55Z

Just curious, how many folks running into this issue are using hyperkube?

tkislan · 2020-03-31T19:18:50Z

I tried reverting from 1.17.3 to 1.16.8, but I was still experiencing the same problem.
Only way how to fix this is to have DaemonSet running, and call ip r add 10.96.0.0/12 dev cni0 on every Node to fix the routing .. after that, it starts to route correctly

LuckySB · 2020-04-11T19:38:07Z

try on node and on pod with hostnetwork:true (podnet 10.244.2.0/24)
coredns running on another node with podnet 10.244.1.0/24

without ip route add 10.96.0.0/16
ip packet sends to coredns pod with src ip 10.244.2.0
IP 10.244.2.0.31782 > 10.244.1.3.domain: 38996+ [1au] A? kubernetes.default. (59)

and tcpdump not show this packet on another side on vxlan tunnel
tcpdump -ni flanel.1

With route

10.96.0.0/16 dev cni0 scope link

src ip changed to addres from cni0, not a flanel.1 interface

4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether 0a:af:85:2e:82:f5 brd ff:ff:ff:ff:ff:ff
    inet 10.244.2.0/32 scope global flannel.1
       valid_lft forever preferred_lft forever
5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
    link/ether ee:39:df:66:22:f3 brd ff:ff:ff:ff:ff:ff
    inet 10.244.2.1/24 scope global cni0
       valid_lft forever preferred_lft forever

and acces to service ipnet works fine.

well
direct access to dns pod working
dig @10.244.1.8 kubernetes.default.svc.cluster.local
and tcpdump show udp request with 10.244.2.0 src address
but acces to cluster 10.96.0.10 ip not!

i try to remove iptables rule created by kube-proxe

iptables -t nat -D POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING

and i get answer from coredns


;; ANSWER SECTION:
kubernetes.default.svc.cluster.local. 30 IN A   10.96.0.1

also work with
iptables -t nat -I POSTROUTING 1 -o eth0 -j ACCEPT

Gacko · 2020-05-06T21:56:07Z

Sorry for being late to the party... I just installed a clean v1.17 cluster, no duplicate iptables routes in there. So it seems like they only occur after upgrading. Anyways the issue persists. I'll continue investigating...

zhangguanzhang · 2020-05-29T01:39:04Z

see this
kubernetes/kubernetes#88986 (comment)

kubealex · 2020-06-06T11:29:00Z

Just a side note, the issue doesn't happen on the node where the pods balanced by the service are deployed:

NAME                                        READY   STATUS      RESTARTS   AGE   IP           NODE                   NOMINATED NODE   READINESS GATES
ingress-nginx-admission-create-fppsm        0/1     Completed   0          26m   10.244.2.2   k8s-worker-0.k8s.lab   <none>           <none>
ingress-nginx-admission-patch-xnfcw         0/1     Completed   0          26m   10.244.2.3   k8s-worker-0.k8s.lab   <none>           <none>
ingress-nginx-controller-69fb496d7d-2k594   1/1     Running     0          26m   10.244.2.6   k8s-worker-0.k8s.lab   <none>           <none>

[kube@k8s-worker-0 ~]$ curl 10.100.76.252
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.17.10</center>
</body>
</html>

[kube@k8s-master-0 ~]$ curl 10.100.76.252
^C

malikbenkirane · 2020-06-25T19:23:20Z

Our workaround is to manually add the route to DNS through a DaemonSet as soon as there is at least one pod running on all workers (so that the cni0 interface appears).

@nonsense could you please provide another example manifest for this

Yes, here it is: https://github.com/ipfs/testground/blob/master/infra/k8s/sidecar.yaml#L23

Ends on 404

nonsense · 2020-06-25T19:44:55Z

@malikbenkirane change ipfs/testground to testground/infra - repo moved - https://github.com/testground/infra/blob/master/k8s/sidecar.yaml

malikbenkirane · 2020-06-25T22:23:18Z

@malikbenkirane change ipfs/testground to testground/infra - repo moved - https://github.com/testground/infra/blob/master/k8s/sidecar.yaml

Thanks, I like the idea. Though I've found using calico rather than flannel working for me. I just had set --flannel-backend=none and followed calico k3s steps changing pod cidr accordingly.

mohideen · 2020-06-29T22:48:58Z

I had the same issue on a HA cluster provisioned by kubeadm with RHEL7 nodes. Both the options (turning of tx-checksum-ip-generic / switching to host-gw from vxlan) worked. Settled with the host-gw option.

This did not affect a RHEL8 cluster provisioned by kubeadm (also that was not a HA cluster).

Gacko · 2020-07-23T12:16:28Z

I guess this can be closed since the related issues have been fixed in Kubernetes.

rdxmb · 2020-07-27T15:59:02Z

@Gacko could you link the issue/PR for that, please?

rafzei · 2020-07-30T23:24:23Z

@rdxmb this one: #92035 and changelog

rdxmb · 2020-07-31T07:48:02Z

@rafzei thanks 👍

muthu31kumar · 2020-09-23T14:54:02Z

+1

immanuelfodor · 2020-11-11T07:17:04Z

I've bumped into the same issue with an RKE v1.19.3 k8s cluster running on CentOS 8 with firewalld completely disabled. The CNI plugin is Canal which uses both Flannel and Calico. Only pods running with hostNetwork: true and ClusterFirstWithHostNet were affected, they couldn't get DNS resolution on nodes that weren't running a CoreDNS pod. As I had 3 nodes and my CoreDNS replica count was set to 2 by the autoscaler, only pods on the 3rd node were affected. As RKE doesn't support manual CoreDNS autoscaling parameters (open issue here: rancher/rke#2247), my solution was to explicitly set the Flannel backend to host-gw from an implicit vxlan in the RKE cluster.yml file. See the docs here: https://rancher.com/docs/rke/latest/en/config-options/add-ons/network-plugins/#canal-network-plug-in-options After that, I did an rke up to apply the changes but it did not have any effect at first, therefore I also needed to reboot all nodes to fix the issue. Now all pods with hostNetwork: true and ClusterFirstWithHostNet on all nodes are working fine.

 network:
   plugin: canal
-  options: {}
+  options:
+    # workaround to get hostnetworked pods DNS resolution working on nodes that don't have a CoreDNS replica running
+    # do the rke up then reboot all nodes to apply
+    # @see: https://github.com/coreos/flannel/issues/1243#issuecomment-589542796
+    # @see: https://rancher.com/docs/rke/latest/en/config-options/add-ons/network-plugins/
+    canal_flannel_backend_type: host-gw
   mtu: 0
   node_selector: {}
   update_strategy: null

Hitendraverma · 2021-07-06T10:41:02Z

I am also getting intermittent issue while running stateful sets in k8 with hostNetwork.
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet

I got it resolved by following below.

Migrated my Kube-dns to Core-dns.
Adding Envoy for permanent fix of intermittent issue of DNS lookup

Also , you can temporarily fix this by running your DNS pod on the same node on which your Application pod is running.
Schedule DNS pod by node selector or by making your other nodes SchedulingDisabled.

legoguy1000 · 2022-11-06T22:08:20Z

I just upgraded our K8S bare metal cluster running on physical servers v1.23.13 to flannel v0.20.1 from v0.17.0 and are having this issue. My pods with hostNetwork: true can't connect to any other service via ClusterIPs. I fixed by adding the static route via the cni0 interface as suggested #1243 (comment).

stale · 2023-05-05T23:59:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

nonsense changed the title ~~ClusterIP services not accessible when using flannel from host machines~~ ClusterIP services not accessible when using flannel from host machines in Kubernetes Jan 13, 2020

nonsense changed the title ~~ClusterIP services not accessible when using flannel from host machines in Kubernetes~~ ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes Jan 13, 2020

This was referenced Jan 14, 2020

README.md for kops/weave/flannel/cnigenie testground/testground#323

Merged

sidecar on Kubernetes testground/testground#336

Merged

mariusgrigoriu mentioned this issue Feb 4, 2020

Super slow access to service IP from host (& host-networked pods) with Flannel CNI #1245

Closed

aojea mentioned this issue Feb 18, 2020

dnsPolicy in hostNetwork not working as expected kubernetes/kubernetes#87852

Closed

This was referenced Feb 18, 2020

Issue on k8s/kops 1.17 clusters uswitch/kiam#378

Closed

1.17 alpha versions causing regression for kiam? kubernetes/kops#8562

Closed

MansM mentioned this issue Feb 22, 2020

Document supported CNI onedr0p/k3s-homeops-ansible#12

Closed

johngmyers mentioned this issue Feb 22, 2020

Remove support for Canal and the vxlan Flannel backend kubernetes/kops#8614

Closed

gamer22026 mentioned this issue Mar 16, 2020

60 second delayed delivery of packet to pod #1268

Closed

davesargrad mentioned this issue Mar 17, 2020

Bare Metal K8S 63 Second Service Routing Delay - when accessing service via ClusterIP, or ExternalIP kubernetes/kubernetes#88986

Closed

nonsense mentioned this issue Apr 1, 2020

Redis starts with THP enabled. testground/testground#767

Closed

Miouge1 mentioned this issue Apr 11, 2020

Update Flannel manifests, install script and version (0.12) + fix tests scripts kubernetes-sigs/kubespray#5937

Merged

jawabuu mentioned this issue May 9, 2020

Getting Real Client IP with k3s k3s-io/k3s#1652

Closed

kuramal mentioned this issue May 15, 2020

Bugfix iptables --random-fully must supported by kernel kubernetes/kubernetes#91137

Closed

oskapt mentioned this issue May 30, 2020

DNS resolution fails with dnsPolicy: ClusterFirstWithHostNet and hostNetwork: true k3s-io/k3s#1827

Closed

heyjared mentioned this issue Jun 8, 2020

Nodes unable to connect to services whose pods are scheduled on other nodes k3s-io/k3s#1266

Closed

david-enli mentioned this issue Jul 6, 2020

problem encountered while using HostNetworkDNSPolicy kubernetes/kubernetes#92276

Closed

sboschman mentioned this issue Jul 31, 2020

Internal error occurred: failed calling webhook cert-manager/cert-manager#2918

Closed

dsupure mentioned this issue Sep 1, 2020

Not working in Openshift 4.5 purestorage/pso-csi#80

Closed

toschneck mentioned this issue Nov 30, 2020

DNS resolution of hosNetwork pods (e.g. Restric Backup Addon) kubermatic/kubeone#1178

Closed

pinfort mentioned this issue May 20, 2021

DNS not configured correctly on a Raspberry Pi cluster #1375

Closed

tokers mentioned this issue May 21, 2021

request help: hostNetwork apache/apisix-ingress-controller#478

Closed

leffeloket mentioned this issue Sep 7, 2022

[BUG] Cannot deploy Longhorn because metrics-server fails longhorn/longhorn#4503

Closed

stale bot added the wontfix label May 5, 2023

stale bot closed this as completed May 27, 2023

matofeder mentioned this issue Jul 19, 2024

[bug] unable to deploy K8s /K3s behind proxy and "BGP on the host" networks osism/issues#1067

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes #1243

ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes #1243

nonsense commented Jan 13, 2020 •

edited

Loading

choryuidentify commented Feb 6, 2020

nonsense commented Feb 6, 2020

MansM commented Feb 15, 2020 •

edited

Loading

MansM commented Feb 15, 2020

rdxmb commented Feb 21, 2020 •

edited

Loading

rdxmb commented Feb 21, 2020

nonsense commented Feb 21, 2020

MansM commented Feb 21, 2020

MansM commented Mar 9, 2020

archever commented Mar 9, 2020

tobiasvdp commented Mar 11, 2020

davesargrad commented Mar 17, 2020

Capitrium commented Mar 23, 2020

mariusgrigoriu commented Mar 31, 2020

tkislan commented Mar 31, 2020

LuckySB commented Apr 11, 2020 •

edited

Loading

Gacko commented May 6, 2020

zhangguanzhang commented May 29, 2020

kubealex commented Jun 6, 2020 •

edited

Loading

malikbenkirane commented Jun 25, 2020 •

edited

Loading

nonsense commented Jun 25, 2020 •

edited

Loading

malikbenkirane commented Jun 25, 2020

mohideen commented Jun 29, 2020

Gacko commented Jul 23, 2020

rdxmb commented Jul 27, 2020

rafzei commented Jul 30, 2020

rdxmb commented Jul 31, 2020

muthu31kumar commented Sep 23, 2020

immanuelfodor commented Nov 11, 2020

Hitendraverma commented Jul 6, 2021

legoguy1000 commented Nov 6, 2022

stale bot commented May 5, 2023

ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes #1243

ClusterIP services not accessible when using flannel CNI from host machines in Kubernetes #1243

Comments

nonsense commented Jan 13, 2020 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Your Environment

choryuidentify commented Feb 6, 2020

nonsense commented Feb 6, 2020

MansM commented Feb 15, 2020 • edited Loading

MansM commented Feb 15, 2020

rdxmb commented Feb 21, 2020 • edited Loading

rdxmb commented Feb 21, 2020

nonsense commented Feb 21, 2020

MansM commented Feb 21, 2020

MansM commented Mar 9, 2020

archever commented Mar 9, 2020

tobiasvdp commented Mar 11, 2020

davesargrad commented Mar 17, 2020

Capitrium commented Mar 23, 2020

mariusgrigoriu commented Mar 31, 2020

tkislan commented Mar 31, 2020

LuckySB commented Apr 11, 2020 • edited Loading

Gacko commented May 6, 2020

zhangguanzhang commented May 29, 2020

kubealex commented Jun 6, 2020 • edited Loading

malikbenkirane commented Jun 25, 2020 • edited Loading

nonsense commented Jun 25, 2020 • edited Loading

malikbenkirane commented Jun 25, 2020

mohideen commented Jun 29, 2020

Gacko commented Jul 23, 2020

rdxmb commented Jul 27, 2020

rafzei commented Jul 30, 2020

rdxmb commented Jul 31, 2020

muthu31kumar commented Sep 23, 2020

immanuelfodor commented Nov 11, 2020

Hitendraverma commented Jul 6, 2021

legoguy1000 commented Nov 6, 2022

stale bot commented May 5, 2023

nonsense commented Jan 13, 2020 •

edited

Loading

MansM commented Feb 15, 2020 •

edited

Loading

rdxmb commented Feb 21, 2020 •

edited

Loading

LuckySB commented Apr 11, 2020 •

edited

Loading

kubealex commented Jun 6, 2020 •

edited

Loading

malikbenkirane commented Jun 25, 2020 •

edited

Loading

nonsense commented Jun 25, 2020 •

edited

Loading