Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 Canal Pod Issue: Timeout in Creating Service Account Token #5328

Closed
klalafaryan opened this issue Jan 28, 2024 · 7 comments
Closed

RKE2 Canal Pod Issue: Timeout in Creating Service Account Token #5328

klalafaryan opened this issue Jan 28, 2024 · 7 comments

Comments

@klalafaryan
Copy link

klalafaryan commented Jan 28, 2024

Environmental Info:
RKE2 Version:
v1.26.12+rke2r1

Node(s) CPU architecture, OS, and Version:
Linux worker-1 6.5.0-15-generic #15~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 12 18:54:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
1 master and 4 workers

Describe the bug:
Canal pod is failing with the following error:

2024-01-25 19:40:31.616 [ERROR][1] cni-installer/ : Unable to create token for CNI kubeconfig error=Post "https://10.2.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/canal/token": dial tcp 10.2.0.1:443: i/o timeout 2024-01-25 19:40:31.616 [FATAL][1] cni-installer/ : Unable to create token for CNI kubeconfig error=Post "https://10.2.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/canal/token": dial tcp 10.2.0.1:443: i/o timeout

Steps To Reproduce:
We have installed RKE2 using the Ansible role available at https://github.com/lablabs/ansible-role-rke2. During this process, we did not apply any customizations to systemd, nor did we override any environment variables.

Unfortunately, reproducing the issue is not straightforward. However, I managed to replicate it by repeatedly restarting the worker node, as well as by restarting both the master and worker nodes simultaneously.

Expected behavior:
The Canal pod should successfully create virtual interfaces using the CNI plugin, and then it will be able to generate the service account Canal token.

Actual behavior:
It appears that, in certain cases, Canal is unable to create virtual interfaces, which are essential for generating the service account Canal token.

Additional context / logs:
The entire log of the failed canal pod:

kubectl logs -f pods/rke2-canal-4wj6m -n kube-system -c install-cni
2024-01-28 09:39:48.947 [INFO][1] cni-installer/<nil> <nil>: Running as a Kubernetes pod
2024-01-28 09:39:49.146 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/bandwidth"
2024-01-28 09:39:49.146 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/bandwidth
2024-01-28 09:39:49.149 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/bridge"
2024-01-28 09:39:49.149 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/bridge
2024-01-28 09:39:49.170 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/calico"
2024-01-28 09:39:49.170 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico
2024-01-28 09:39:49.191 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/calico-ipam"
2024-01-28 09:39:49.191 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico-ipam
2024-01-28 09:39:49.196 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/dhcp"
2024-01-28 09:39:49.196 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/dhcp
2024-01-28 09:39:49.198 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/dummy"
2024-01-28 09:39:49.198 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/dummy
2024-01-28 09:39:49.200 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/firewall"
2024-01-28 09:39:49.200 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/firewall
2024-01-28 09:39:49.201 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/flannel"
2024-01-28 09:39:49.201 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/flannel
2024-01-28 09:39:49.203 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/host-device"
2024-01-28 09:39:49.203 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/host-device
2024-01-28 09:39:49.206 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/host-local"
2024-01-28 09:39:49.206 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/host-local
2024-01-28 09:39:49.227 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/install"
2024-01-28 09:39:49.227 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/install
2024-01-28 09:39:49.229 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/ipvlan"
2024-01-28 09:39:49.229 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/ipvlan
2024-01-28 09:39:49.230 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/loopback"
2024-01-28 09:39:49.230 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/loopback
2024-01-28 09:39:49.232 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/macvlan"
2024-01-28 09:39:49.232 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/macvlan
2024-01-28 09:39:49.235 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/portmap"
2024-01-28 09:39:49.235 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/portmap
2024-01-28 09:39:49.237 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/ptp"
2024-01-28 09:39:49.237 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/ptp
2024-01-28 09:39:49.238 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/sbr"
2024-01-28 09:39:49.238 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/sbr
2024-01-28 09:39:49.239 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/static"
2024-01-28 09:39:49.239 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/static
2024-01-28 09:39:49.241 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/tuning"
2024-01-28 09:39:49.241 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/tuning
2024-01-28 09:39:49.243 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/vlan"
2024-01-28 09:39:49.243 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/vlan
2024-01-28 09:39:49.244 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/vrf"
2024-01-28 09:39:49.244 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/vrf
2024-01-28 09:39:49.244 [INFO][1] cni-installer/<nil> <nil>: Wrote Calico CNI binaries to /host/opt/cni/bin

2024-01-28 09:39:49.267 [INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.26.3

2024-01-28 09:39:49.267 [INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
W0128 09:39:49.267437       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-01-28 09:40:19.297 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://10.2.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/canal/token": dial tcp 10.2.0.1:443: i/o timeout
2024-01-28 09:40:19.297 [FATAL][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://10.2.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/canal/token": dial tcp 10.2.0.1:443: i/o timeout

Important note:

The issue is happening in Ubuntu 22.04 which has following network adapter:
BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller

We have several other production servers using the same RKE2 version, equipped with Intel NICs (Ethernet Controller 10-Gigabit X540-AT2), and they are functioning properly.

@brandond
Copy link
Member

2024-01-28 09:40:19.297 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://10.2.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/canal/token": dial tcp 10.2.0.1:443: i/o timeout
2024-01-28 09:40:19.297 [FATAL][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://10.2.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/canal/token": dial tcp 10.2.0.1:443: i/o timeout

When you see this happen, is there a kube-proxy pod running on the affected node? Are you able to curl -vks https://10.2.0.1:443 from the node?

@klalafaryan
Copy link
Author

Thanks @brandond

Ran the following to check the status of pods in kube-system:

kubectl get pods -n kube-system -o wide | grep dev-worker-4

Output:

kube-proxy-dev-worker-4                  1/1     Running      7 (2d18h ago)    2d18h   XXX.XX.XXX.XX   dev-worker-4   <none>           <none>
node-local-dns-8sx9g                     1/1     Running      10 (9m7s ago)    2d18h   XXX.XX.XXX.XX   dev-worker-4   <none>           <none>
rke2-canal-gnlv9                         0/2     Init:Error   6 (3m29s ago)    3h18m   XXX.XX.XXX.XX   dev-worker-4   <none>           <none>
rke2-ingress-nginx-controller-qsd8p      0/1     Unknown      7                2d18h   <none>          dev-worker-4   <none>           <none>

Canal is failing with the same error:
The rke2-canal-gnlv9 pod is failing to initialize, logging the following errors:

2024-01-30 19:51:33.506 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://10.2.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/canal/token": dial tcp 10.2.0.1:443: i/o timeout
2024-01-30 19:51:33.506 [FATAL][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://10.2.0.1:443/api/v1/namespaces/kube-system/serviceaccounts/canal/token": dial tcp 10.2.0.1:443: i/o timeout

Although kube-proxy-dev-worker-4 shows as Running, fetching logs with:

kubectl logs -f pods/kube-proxy-dev-worker-4 -n kube-system

Results in:

Error from server (BadRequest): container "kube-proxy" in pod "kube-proxy-dev-worker-4" is not available

Steps to Reproduce

  • Rebooted both the master (single instance) and a worker node.
  • Post-reboot, node-local-dns and canal pods restarted as expected.
  • However, kube-proxy did not restart correctly, despite being rebooted around the same time (9m7s ago).

Any thoughts ?

@klalafaryan
Copy link
Author

klalafaryan commented Jan 30, 2024

I was not able to curl -vks https://10.2.0.1:443

curl -vks https://10.2.0.1:443
*   Trying 10.2.0.1:443...

@brandond
Copy link
Member

This sounds like the same underlying issue as #4864

@klalafaryan
Copy link
Author

klalafaryan commented Jan 30, 2024

Is kube-proxy used by canal ?

I have checked in the host, and the virtual interfaces are missing. I was thinking that it could be related to CNI plugin.

I have deleted /var/lib/rancher/rke2/agent/pod-manifests/kube-proxy.yaml and then restarted agent, and kube-proxy recovered and canal started to work as well.

However, I have tried to intentionally make the kube-proxy pod crashing by doing following:

I have updated the certificate and put the random string so it will be invalid.

/var/lib/rancher/rke2/agent/kubeproxy.kubeconfig

and then

kill -15 $KUBE_PROXY_PID

subsequently, the kube-proxy began to fail, which was the desired outcome. Following this, I terminated the canal pod, and it successfully restarted.

I have also attempted to delete the network interface using sudo ip link delete flannel.1. After restarting canal, the network interface was recreated. During this entire process, kube-proxy-i was failing.

Any thoughts ?

@brandond
Copy link
Member

brandond commented Jan 30, 2024

Is kube-proxy used by canal ?

The logs indicate that the cni-installer failure was due to an inability to reach the in-cluster kubernetes service endpoint to create a token. Access to cluster service endpoints is handled by kube-proxy. So yes.

@caroline-suse-rancher
Copy link
Contributor

I'm closing this one since it's a duplicate of #4864

@caroline-suse-rancher caroline-suse-rancher closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants