You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to upgrade an old cluster. Started with v2.13 and got it as far as v2.17 with no downtime, i.e. drain_nodes: false. Finally got stuck in 2.18. I think the problem is #8107 where kubespray tries to delete the runc binary but it's still used by running containers. upgrade-cluster.yml completes without any issue leaving some pods in Pending state.
Nov 18 20:48:35 node2 containerd[5973]: time="2022-11-18T20:48:35.008434930Z" level=info msg="Kill container \"25210581255c8847044ad16eadae98e95889274d7048b5e755a030969252dc6f\""
Nov 18 20:48:35 node2 containerd[5973]: time="2022-11-18T20:48:35.009612513Z" level=error msg="StopPodSandbox for \"04592e8a2f7d82d2a236a71b2d8742d550ece1a4667ec069c0a3d819a1595e7f\" failed" error="failed to stop container \"25210581255c8847044ad16eadae98e95889274d7048b5e755a030969252dc6f\": failed to kill container \"25210581255c8847044ad16eadae98e95889274d7048b5e755a030969252dc6f\": unknown error after kill: fork/exec /usr/bin/runc: no such file or directory: : unknown"
Nov 18 20:48:35 node2 containerd[5973]: time="2022-11-18T20:48:35.648414159Z" level=error msg="ExecSync for \"6faa2c1f1a3a19f131b12566363d845a8f0b536465ac81bad193338e5d39cb39\" failed" error="failed to exec in container: failed to start exec \"2c6df93ef30715f128118f3b0963162c6df70a55d2336711e70bddb7c9b49248\": OCI runtime exec failed: fork/exec /usr/bin/runc: no such file or directory: unknown"
Nov 18 20:48:35 node2 containerd[5973]: time="2022-11-18T20:48:35.658364559Z" level=error msg="ExecSync for \"6faa2c1f1a3a19f131b12566363d845a8f0b536465ac81bad193338e5d39cb39\" failed" error="failed to exec in container: failed to start exec \"a3d92071ed3baf6104d0cc97e364427e3dc1fa6330f209c8a0f5cb8a3d2ff801\": OCI runtime exec failed: fork/exec /usr/bin/runc: no such file or directory: unknown"
I saw that I'm missing /usr/bin/runc and found out 2.18 deleting it, as it now uses /usr/local/bin/runc instead. I tried to override by setting runc_bin_dir to /usr/bin and keeping it in place. This prevents kubespray from deleting the binary but it didn't help. Now upgrade-cluster.yml fails with kubeadm upgrade task timing out.
TASK [kubernetes/control-plane : kubeadm | Check api is up] **********************************************************************************************************************************************************************************
ok: [node2]
Friday 18 November 2022 20:02:30 +0000 (0:00:00.676) 0:11:15.855 *******
Friday 18 November 2022 20:02:30 +0000 (0:00:00.048) 0:11:15.903 *******
TASK [kubernetes/control-plane : kubeadm | Upgrade other masters] ****************************************************************************************************************************************************************************
fatal: [node2]: FAILED! => {"changed": true, "cmd": [
"timeout",
"-k",
"600s",
"600s",
"/usr/local/bin/kubeadm",
"upgrade",
"apply",
"-y",
"v1.22.8",
"--certificate-renewal=True",
"--config=/etc/kubernetes/kubeadm-config.yaml",
"--ignore-preflight-errors=all",
"--allow-experimental-upgrades",
"--etcd-upgrade=false",
"--force"], "delta": "0:05:09.724747",
"end": "2022-11-18 20:07:51.093915",
"failed_when_result": true,
"msg": "non-zero return code",
"rc": 1,
"start": "2022-11-18 20:02:41.369168",
"stderr": "
W1118 20:02:41.418768 9606 common.go:95] WARNING: Usage of the --config flag with kubeadm config types for reconfiguring the cluster during upgrade is not recommended!
W1118 20:02:41.427069 9606 utils.go:69] The recommended value for \"clusterDNS\" in \"KubeletConfiguration\" is: [10.233.0.10]; the provided value is: [169.254.25.10]
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: timed out waiting for the condition
To see the stack trace of this error execute with --v=5 or higher",
"stdout": "
[upgrade/config] Making sure the configuration is correct:
[preflight] Running pre-flight checks.
[upgrade] Running cluster health checks
[upgrade/version] You have chosen to change the cluster version to \"v1.22.8\"
[upgrade/versions] Cluster version: v1.21.6
[upgrade/versions] kubeadm version: v1.22.8
[upgrade/prepull] Pulling images required for setting up a Kubernetes cluster
[upgrade/prepull] This might take a minute or two, depending on the speed of your internet connection
[upgrade/prepull] You can also perform this action in beforehand using 'kubeadm config images pull'
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version \"v1.22.8\"...
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-controller-manager-node2 hash: ae9eeffc347a33af612783c4a9a77935
Static pod: kube-scheduler-node2 hash: a01659fd4cce53f573cb0e0ee9099108
[upgrade/staticpods] Writing new Static Pod manifests to \"/etc/kubernetes/tmp/kubeadm-upgraded-manifests264605725\"
[upgrade/staticpods] Preparing for \"kube-apiserver\" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Moved new manifest to \"/etc/kubernetes/manifests/kube-apiserver.yaml\" and backed up old manifest to \"/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-18-20-02-50/kube-apiserver.yaml\"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
...
kubelet complains that it cannot kill the older apiserver, because runc...
Nov 18 20:02:51 node2 kubelet[8572]: I1118 20:02:51.069458 8572 kubelet.go:2063] "SyncLoop REMOVE" source="file" pods=[kube-system/kube-apiserver-node2]
Nov 18 20:02:51 node2 kubelet[8572]: I1118 20:02:51.069906 8572 kuberuntime_container.go:723] "Killing container with a grace period" pod="kube-system/kube-apiserver-node2" podUID=c1e18bf602ea91bb833c1f9e7826823d containerName="kube-apiserver" containerID="containerd://b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5" gracePeriod=30
Nov 18 20:02:51 node2 kubelet[8572]: E1118 20:02:51.070355 8572 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = failed to exec in container: failed to start exec \"737614d33fb283968b43e8134b2138a88dd1ebc9dd9fea465a93140fdc593f97\": OCI runtime exec failed: fork/exec /usr/bin/runc: no such file or directory: unknown" containerID="7931dc8b641d0aac8ce4e04a3bfcac9ea419872d471de33c8448125f76b6528a" cmd=[env -i sh -c ceph --admin-daemon /run/ceph/ceph-osd.151.asok status]
Nov 18 20:02:51 node2 kubelet[8572]: E1118 20:02:51.071866 8572 remote_runtime.go:276] "StopContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to stop container \"b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5\": unknown error after kill: exec: \"runc\": executable file not found in $PATH: <nil>: unknown" containerID="b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5"
Nov 18 20:02:51 node2 kubelet[8572]: E1118 20:02:51.072000 8572 kuberuntime_container.go:728] "Container termination failed with gracePeriod" err="rpc error: code = Unknown desc = failed to stop container \"b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5\": unknown error after kill: exec: \"runc\": executable file not found in $PATH: <nil>: unknown" pod="kube-system/kube-apiserver-node2" podUID=c1e18bf602ea91bb833c1f9e7826823d containerName="kube-apiserver" containerID="containerd://b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5" gracePeriod=30
Nov 18 20:02:51 node2 kubelet[8572]: E1118 20:02:51.072102 8572 kuberuntime_container.go:753] "Kill container failed" err="rpc error: code = Unknown desc = failed to stop container \"b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5\": unknown error after kill: exec: \"runc\": executable file not found in $PATH: <nil>: unknown" pod="kube-system/kube-apiserver-node2" podUID=c1e18bf602ea91bb833c1f9e7826823d containerName="kube-apiserver" containerID={Type:containerd ID:b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5}
I know I'm talking about pretty old stuff but still. I'll try to keep orphans a bit longer I guess.
The text was updated successfully, but these errors were encountered:
/usr/bin/runc is not orphaned in ubuntu 18.04. It's in containerd.io package and is removed with it. When containerd is restarted with the downloaded tarball - instead of the apt package, it still looks for /usr/bin/runc, cannot find it and cannot check already running containers. This should've been fixed by setting runc_bin_dir to /usr/bin but it doesn't. Because containerd apt is removed after runc is copied. In other words kubespray would then replace a deb owned binary, and remove it afterwards
Fixed it by linking the updated /usr/local/bin/runc to /usr/bin/runc after removing the containerd apt package. And I had to delete the remove orphaned runc step of course.
Environment:
Cloud provider or hardware configuration:
Bare metal, dual core intel
OS (
printf "$(uname -srm)\n$(cat /etc/os-release)\n"
):Linux 4.15.0-74-generic x86_64
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
Version of Ansible (
ansible --version
):whatevers in kubespray/kubespray:v2.18.2 docker image
Version of Python (
python --version
):whatevers in kubespray/kubespray:v2.18.2 docker image
Kubespray version (commit) (
git rev-parse --short HEAD
):kubespray/kubespray:v2.18.2 docker image
Network plugin used:
cni
Full inventory with variables (
ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"
):Command used to invoke ansible:
Output of ansible run:
Anything else do we need to know:
I'm trying to upgrade an old cluster. Started with v2.13 and got it as far as v2.17 with no downtime, i.e. drain_nodes: false. Finally got stuck in 2.18. I think the problem is #8107 where kubespray tries to delete the runc binary but it's still used by running containers. upgrade-cluster.yml completes without any issue leaving some pods in Pending state.
Containerd says :
I saw that I'm missing /usr/bin/runc and found out 2.18 deleting it, as it now uses /usr/local/bin/runc instead. I tried to override by setting runc_bin_dir to /usr/bin and keeping it in place. This prevents kubespray from deleting the binary but it didn't help. Now upgrade-cluster.yml fails with kubeadm upgrade task timing out.
kubelet complains that it cannot kill the older apiserver, because runc...
I know I'm talking about pretty old stuff but still. I'll try to keep orphans a bit longer I guess.
The text was updated successfully, but these errors were encountered: