Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport release-1.26] rke-canal pod is not running due to incompatible ipset protocol version #4215

Closed
rancherbot opened this issue May 12, 2023 · 1 comment
Assignees

Comments

@rancherbot
Copy link
Collaborator

This is a backport issue for #4145, automatically created via rancherbot by @rbrtbnfgl

Original issue description:

Environmental Info:
RKE2 Version:
v1.26.4+rke2r1

Node(s) CPU architecture, OS, and Version:
Linux k8s-agent16 5.15.0-70-generic #77-Ubuntu SMP Tue Mar 21 14:02:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

2 servers, 16 agents all running Ubuntu 22.04

Describe the bug:

rke2-canal pods on some agents are not starting. The pod logs contain the following.

2023-04-27 12:01:53.719 [WARNING][2437501] felix/ipsets.go 319: Failed to resync with dataplane error=exit status 1 family="inet"
2023-04-27 12:01:53.752 [INFO][2437501] felix/ipsets.go 309: Retrying after an ipsets update failure... family="inet"
2023-04-27 12:01:53.753 [ERROR][2437501] felix/ipsets.go 569: Bad return code from 'ipset list'. error=exit status 1 family="inet" stderr="ipset v6.36: Kernel support protocol versions 6-7 while userspace supports protocol versions 6-6\nKernel and userspace incompatible: settype hash:net with revision 7 not supported by userspace.\n"

There was a similar issue reported at projectcalico/calico#5011. But, it's mentioned that it only happens if kube-proxy mode is ipvs and it shouldn't impact if the proxy-mode is iptables. I have confirmed that the proxy-mode is iptables. Here are the logs from the kube-proxy pod.

I0424 19:10:54.248089       1 server.go:224] "Warning, all flags other than --config, --write-config-to, and --cleanup are deprecated, please begin using a config file ASAP"
I0424 19:10:54.257660       1 node.go:163] Successfully retrieved node IP: 192.168.39.77
I0424 19:10:54.257687       1 server_others.go:109] "Detected node IP" address="192.168.39.77"
I0424 19:10:54.294553       1 server_others.go:176] "Using iptables Proxier"
I0424 19:10:54.294622       1 server_others.go:183] "kube-proxy running in dual-stack mode" ipFamily=IPv4
I0424 19:10:54.294646       1 server_others.go:184] "Creating dualStackProxier for iptables"
I0424 19:10:54.294680       1 server_others.go:465] "Detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6"
I0424 19:10:54.294748       1 proxier.go:242] "Setting route_localnet=1 to allow node-ports on localhost; to change this either disable iptables.localhostNodePorts (--iptables-localhost-nodeports) or set nodePortAddresses (--nodeport-addresses) to filter loopback addresses"
I0424 19:10:54.295311       1 server.go:655] "Version info" version="v1.26.4+rke2r1"
I0424 19:10:54.295341       1 server.go:657] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0424 19:10:54.296172       1 config.go:226] "Starting endpoint slice config controller"
I0424 19:10:54.296195       1 shared_informer.go:270] Waiting for caches to sync for endpoint slice config
I0424 19:10:54.296235       1 config.go:444] "Starting node config controller"
I0424 19:10:54.296255       1 shared_informer.go:270] Waiting for caches to sync for node config
I0424 19:10:54.296254       1 config.go:317] "Starting service config controller"
I0424 19:10:54.296275       1 shared_informer.go:270] Waiting for caches to sync for service config
I0424 19:10:54.397015       1 shared_informer.go:277] Caches are synced for node config
I0424 19:10:54.397063       1 shared_informer.go:277] Caches are synced for endpoint slice config
I0424 19:10:54.397175       1 shared_informer.go:277] Caches are synced for service config
E0425 16:30:24.197986       1 service_health.go:187] "Healthcheck closed" err="accept tcp [::]:32666: use of closed network connection" service="istio-system/istio-ingressgateway"
E0425 16:30:24.198068       1 service_health.go:187] "Healthcheck closed" err="accept tcp [::]:32675: use of closed network connection" service="istio-system/istio-internal-ingressgateway"

Steps To Reproduce:

Installed RKE2 using the following steps

sudo swapoff -a

hostnamectl set-hostname k8s-master01

# add the master node details in every node
vi /etc/hosts
192.168.39.5 k8s-master01


# kubectl install on Debian based distributions
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl
sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update
sudo apt install -y kubectl

#network_bridges
sudo tee -a /etc/sysctl.d/99-kubernetes.conf <<EOF
net.bridge.bridge-nf-call-ip6tables = 1 
net.bridge.bridge-nf-call-iptables = 1
EOF

cat >>/etc/sysctl.d/kubernetes.conf<<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF

sysctl --system

curl -sfL https://get.rke2.io | INSTALL_RKE2_CHANNEL=latest sh -

# first server node 
systemctl enable rke2-server
systemctl start rke2-server

export KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin

cat /var/lib/rancher/rke2/server/node-token

# second server node
mkdir -p /etc/rancher/rke2
vi /etc/rancher/rke2/config.yaml
server: https://192.168.39.2:9345
token: <TOKEN_FROM_THE_ABOVE_CAT_COMMAND>

systemctl enable rke2-server
systemctl start rke2-server

# agent node
mkdir -p /etc/rancher/rke2
vi /etc/rancher/rke2/config.yaml

server: https://192.168.39.2:9345
token: <TOKEN_FROM_THE_ABOVE_CAT_COMMAND>

systemctl enable rke2-agent
systemctl start rke2-agent

Expected behavior:
Running kubectl get pod -n kube-system should result in all rke2-canal pods running successfully.

Actual behavior:

Some of the rke2-canal are stuck at Ready 1/2

Additional context / logs:

On host:

# ipset version
ipset v7.15, protocol version: 7

On rke2-canal pod and calico-node container running in the same host:

sh-4.4# ipset version
ipset v6.36, protocol version: 6

Note that this behavior is observed only on one server node and one agent node. All other nodes are working fine. One common thing on both these nodes is that the output of ipset list contained sets with Revision: 7 in them.

Output of ipset list from the problematic agent node:

Name: cali40all-ipam-pools
Type: hash:net
Revision: 7
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x54a33cf9
Size in memory: 504
References: 0
Number of entries: 1
Members:
10.42.0.0/16

Name: cali40masq-ipam-pools
Type: hash:net
Revision: 7
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0xdd3fa3ae
Size in memory: 504
References: 0
Number of entries: 1
Members:
10.42.0.0/16

Name: cali40this-host
Type: hash:ip
Revision: 5
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x2d605c42
Size in memory: 360
References: 0
Number of entries: 4
Members:
192.168.39.77
10.42.180.64
127.0.0.1
127.0.0.0

Name: cali40all-vxlan-net
Type: hash:net
Revision: 7
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x17e7d094
Size in memory: 1320
References: 0
Number of entries: 18
Members:
192.168.39.99
192.168.39.91
192.168.39.154
192.168.39.72
192.168.39.3
192.168.39.79
192.168.39.98
192.168.39.74
192.168.39.78
192.168.39.96
192.168.39.92
192.168.39.94
192.168.39.151
192.168.39.93
192.168.39.2
192.168.39.1
192.168.39.97
192.168.39.95

Output of ipset list from the node that is working fine:

Name: cali40all-ipam-pools
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 504
References: 1
Number of entries: 1
Members:
172.16.0.0/16

Name: cali40masq-ipam-pools
Type: hash:net
Revision: 6
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 504
References: 1
Number of entries: 1
Members:
172.16.0.0/16

Name: cali40this-host
Type: hash:ip
Revision: 4
Header: family inet hashsize 1024 maxelem 1048576
Size in memory: 360
References: 0
Number of entries: 4
Members:
10.42.30.0
127.0.0.1
192.168.39.96
127.0.0.0
@fmoral2
Copy link
Contributor

fmoral2 commented May 19, 2023

Validated on Version:

- rke2 version v1.26.4+dev.29430443 (2943044380981b5acdd53ed44e0136ae9f14199a)

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
Ubuntu

Cluster Configuration:
1 node

Config.yaml:

cni: canal
write-kubeconfig-mode: 644
token: test
EOF'


Steps to validate the fix

  1. Install rke2 in commit version
  2. Check calico image version

Validation Results:

rke2 version v1.26.4+dev.29430443 (2943044380981b5acdd53ed44e0136ae9f14199a)
~$ k get pods -A
NAMESPACE     NAME                                                                  READY   STATUS      RESTARTS   AGE
kube-system   cloud-controller-manager-ip-172-31-5-109.us-east-2.compute.internal   1/1     Running     0          2m59s
kube-system   etcd-ip-172-31-5-109.us-east-2.compute.internal                       1/1     Running     0          3m5s
kube-system   helm-install-rke2-canal-rzc7d                                         0/1     Completed   0          2m42s
kube-system   helm-install-rke2-coredns-k9x9w                                       0/1     Completed   0          2m42s
kube-system   helm-install-rke2-ingress-nginx-n59g4                                 1/1     Running     0          2m42s
kube-system   helm-install-rke2-metrics-server-g4k6f                                1/1     Running     0          2m42s
kube-system   helm-install-rke2-snapshot-controller-crd-bpbdp                       1/1     Running     0          2m42s
kube-system   helm-install-rke2-snapshot-controller-hxpzk                           1/1     Running     0          2m42s
kube-system   helm-install-rke2-snapshot-validation-webhook-zvbk4                   1/1     Running     0          2m42s
kube-system   kube-apiserver-ip-172-31-5-109.us-east-2.compute.internal             1/1     Running     0          3m3s
kube-system   kube-controller-manager-ip-172-31-5-109.us-east-2.compute.internal    1/1     Running     0          3m1s
kube-system   kube-proxy-ip-172-31-5-109.us-east-2.compute.internal                 1/1     Running     0          2m55s
kube-system   kube-scheduler-ip-172-31-5-109.us-east-2.compute.internal             1/1     Running     0          3m1s
kube-system   rke2-canal-n8rdk                                                      2/2     Running     0          2m17s
kube-system   rke2-coredns-rke2-coredns-autoscaler-597fb897d7-x9wvk                 1/1     Running     0          2m5s
kube-system   rke2-coredns-rke2-coredns-f6f4ff467-cpj2p                             0/1     Running     0          2m5s




~$ kubectl describe pod rke2-canal-n8rdk -n kube-system | grep "Image"

    Image:         rancher/hardened-calico:v3.25.1-build20230512
    Image ID:      docker.io/rancher/hardened-calico@sha256:1f53576f9d9cd64887e3bda5714eb5e0b3bf082590549926c000f8faa8260fe8
    Image:         rancher/hardened-calico:v3.25.1-build20230512
    Image ID:      docker.io/rancher/hardened-calico@sha256:1f53576f9d9cd64887e3bda5714eb5e0b3bf082590549926c000f8faa8260fe8
    Image:         rancher/hardened-calico:v3.25.1-build20230512
    Image ID:      docker.io/rancher/hardened-calico@sha256:1f53576f9d9cd64887e3bda5714eb5e0b3bf082590549926c000f8faa8260fe8
    Image:         rancher/hardened-flannel:v0.21.3-build20230308
    Image ID:      docker.io/rancher/hardened-flannel@sha256:8f550617f90d4e7914873d3253b56fac047cbb05b6409410220d7cc0ba9af70c




@fmoral2 fmoral2 closed this as completed May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants