Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading to 2.18 breaks pods #9504

Closed
agaoglu opened this issue Nov 18, 2022 · 2 comments
Closed

Upgrading to 2.18 breaks pods #9504

agaoglu opened this issue Nov 18, 2022 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@agaoglu
Copy link

agaoglu commented Nov 18, 2022

Environment:

  • Cloud provider or hardware configuration:
    Bare metal, dual core intel

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Linux 4.15.0-74-generic x86_64
    NAME="Ubuntu"
    VERSION="18.04.3 LTS (Bionic Beaver)"

  • Version of Ansible (ansible --version):
    whatevers in kubespray/kubespray:v2.18.2 docker image

  • Version of Python (python --version):
    whatevers in kubespray/kubespray:v2.18.2 docker image

Kubespray version (commit) (git rev-parse --short HEAD):
kubespray/kubespray:v2.18.2 docker image

Network plugin used:
cni

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):

Command used to invoke ansible:

docker run --rm -it \
  -v $(pwd)/$VERSION/inventory/$DEPLOYMENT:/inventory/this \
  -v ${HOME}/.ssh/id_rsa:/root/.ssh/id_rsa \
  kubespray/kubespray:$VERSION \
  ansible-playbook upgrade-cluster.yml -b \
    -i /inventory/this/inventory.ini \
    -e '{ "drain_nodes": false }'

Output of ansible run:

Anything else do we need to know:

I'm trying to upgrade an old cluster. Started with v2.13 and got it as far as v2.17 with no downtime, i.e. drain_nodes: false. Finally got stuck in 2.18. I think the problem is #8107 where kubespray tries to delete the runc binary but it's still used by running containers. upgrade-cluster.yml completes without any issue leaving some pods in Pending state.

metallb-system         speaker-hlmb6                                     0/1     Pending             0          12h    10.200.2.1    node2   <none>           <none>
monitoring             ipmi-exporter-qgmfd                               0/1     Pending             0          12h    10.200.2.1    node2   <none>           <none>
monitoring             node-exporter-kv8cr                               0/2     Pending             0          12h    10.200.2.1    node2   <none>           <none>
monitoring             prometheus-k8s-1                                  0/3     Pending             1          12h    <none>        node2   <none>           <none>
monitoring             prometheus-node-1                                 0/3     Pending             1          12h    <none>        node2   <none>           <none>

Containerd says :

Nov 18 20:48:35 node2 containerd[5973]: time="2022-11-18T20:48:35.008434930Z" level=info msg="Kill container \"25210581255c8847044ad16eadae98e95889274d7048b5e755a030969252dc6f\""
Nov 18 20:48:35 node2 containerd[5973]: time="2022-11-18T20:48:35.009612513Z" level=error msg="StopPodSandbox for \"04592e8a2f7d82d2a236a71b2d8742d550ece1a4667ec069c0a3d819a1595e7f\" failed" error="failed to stop container \"25210581255c8847044ad16eadae98e95889274d7048b5e755a030969252dc6f\": failed to kill container \"25210581255c8847044ad16eadae98e95889274d7048b5e755a030969252dc6f\": unknown error after kill: fork/exec /usr/bin/runc: no such file or directory: : unknown"
Nov 18 20:48:35 node2 containerd[5973]: time="2022-11-18T20:48:35.648414159Z" level=error msg="ExecSync for \"6faa2c1f1a3a19f131b12566363d845a8f0b536465ac81bad193338e5d39cb39\" failed" error="failed to exec in container: failed to start exec \"2c6df93ef30715f128118f3b0963162c6df70a55d2336711e70bddb7c9b49248\": OCI runtime exec failed: fork/exec /usr/bin/runc: no such file or directory: unknown"
Nov 18 20:48:35 node2 containerd[5973]: time="2022-11-18T20:48:35.658364559Z" level=error msg="ExecSync for \"6faa2c1f1a3a19f131b12566363d845a8f0b536465ac81bad193338e5d39cb39\" failed" error="failed to exec in container: failed to start exec \"a3d92071ed3baf6104d0cc97e364427e3dc1fa6330f209c8a0f5cb8a3d2ff801\": OCI runtime exec failed: fork/exec /usr/bin/runc: no such file or directory: unknown"

I saw that I'm missing /usr/bin/runc and found out 2.18 deleting it, as it now uses /usr/local/bin/runc instead. I tried to override by setting runc_bin_dir to /usr/bin and keeping it in place. This prevents kubespray from deleting the binary but it didn't help. Now upgrade-cluster.yml fails with kubeadm upgrade task timing out.

TASK [kubernetes/control-plane : kubeadm | Check api is up] **********************************************************************************************************************************************************************************
ok: [node2] 
Friday 18 November 2022  20:02:30 +0000 (0:00:00.676)       0:11:15.855 *******                                                                                                                                                               
Friday 18 November 2022  20:02:30 +0000 (0:00:00.048)       0:11:15.903 *******

 
TASK [kubernetes/control-plane : kubeadm | Upgrade other masters] ****************************************************************************************************************************************************************************
fatal: [node2]: FAILED! => {"changed": true, "cmd": [
 "timeout",
 "-k",
 "600s",
 "600s",
 "/usr/local/bin/kubeadm",
 "upgrade",
 "apply",
 "-y",
 "v1.22.8",
 "--certificate-renewal=True",
 "--config=/etc/kubernetes/kubeadm-config.yaml",
 "--ignore-preflight-errors=all",
 "--allow-experimental-upgrades",
 "--etcd-upgrade=false",
 "--force"], "delta": "0:05:09.724747",
 "end": "2022-11-18 20:07:51.093915",
 "failed_when_result": true,
 "msg": "non-zero return code",
 "rc": 1,
 "start": "2022-11-18 20:02:41.369168",
 "stderr": "
W1118 20:02:41.418768    9606 common.go:95] WARNING: Usage of the --config flag with kubeadm config types for reconfiguring the cluster during upgrade is not recommended!
W1118 20:02:41.427069    9606 utils.go:69] The recommended value for \"clusterDNS\" in \"KubeletConfiguration\" is: [10.233.0.10]; the provided value is: [169.254.25.10]
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: timed out waiting for the condition
To see the stack trace of this error execute with --v=5 or higher",
 "stdout": "
[upgrade/config] Making sure the configuration is correct:
[preflight] Running pre-flight checks.
[upgrade] Running cluster health checks
[upgrade/version] You have chosen to change the cluster version to \"v1.22.8\"
[upgrade/versions] Cluster version: v1.21.6
[upgrade/versions] kubeadm version: v1.22.8
[upgrade/prepull] Pulling images required for setting up a Kubernetes cluster
[upgrade/prepull] This might take a minute or two, depending on the speed of your internet connection
[upgrade/prepull] You can also perform this action in beforehand using 'kubeadm config images pull'
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version \"v1.22.8\"...
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-controller-manager-node2 hash: ae9eeffc347a33af612783c4a9a77935
Static pod: kube-scheduler-node2 hash: a01659fd4cce53f573cb0e0ee9099108
[upgrade/staticpods] Writing new Static Pod manifests to \"/etc/kubernetes/tmp/kubeadm-upgraded-manifests264605725\"
[upgrade/staticpods] Preparing for \"kube-apiserver\" upgrade
[upgrade/staticpods] Renewing apiserver certificate
[upgrade/staticpods] Renewing apiserver-kubelet-client certificate
[upgrade/staticpods] Renewing front-proxy-client certificate
[upgrade/staticpods] Moved new manifest to \"/etc/kubernetes/manifests/kube-apiserver.yaml\" and backed up old manifest to \"/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-11-18-20-02-50/kube-apiserver.yaml\"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
Static pod: kube-apiserver-node2 hash: c1e18bf602ea91bb833c1f9e7826823d
...

kubelet complains that it cannot kill the older apiserver, because runc...

Nov 18 20:02:51 node2 kubelet[8572]: I1118 20:02:51.069458    8572 kubelet.go:2063] "SyncLoop REMOVE" source="file" pods=[kube-system/kube-apiserver-node2]
Nov 18 20:02:51 node2 kubelet[8572]: I1118 20:02:51.069906    8572 kuberuntime_container.go:723] "Killing container with a grace period" pod="kube-system/kube-apiserver-node2" podUID=c1e18bf602ea91bb833c1f9e7826823d containerName="kube-apiserver" containerID="containerd://b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5" gracePeriod=30
Nov 18 20:02:51 node2 kubelet[8572]: E1118 20:02:51.070355    8572 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = failed to exec in container: failed to start exec \"737614d33fb283968b43e8134b2138a88dd1ebc9dd9fea465a93140fdc593f97\": OCI runtime exec failed: fork/exec /usr/bin/runc: no such file or directory: unknown" containerID="7931dc8b641d0aac8ce4e04a3bfcac9ea419872d471de33c8448125f76b6528a" cmd=[env -i sh -c ceph --admin-daemon /run/ceph/ceph-osd.151.asok status]
Nov 18 20:02:51 node2 kubelet[8572]: E1118 20:02:51.071866    8572 remote_runtime.go:276] "StopContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to stop container \"b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5\": unknown error after kill: exec: \"runc\": executable file not found in $PATH: <nil>: unknown" containerID="b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5"
Nov 18 20:02:51 node2 kubelet[8572]: E1118 20:02:51.072000    8572 kuberuntime_container.go:728] "Container termination failed with gracePeriod" err="rpc error: code = Unknown desc = failed to stop container \"b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5\": unknown error after kill: exec: \"runc\": executable file not found in $PATH: <nil>: unknown" pod="kube-system/kube-apiserver-node2" podUID=c1e18bf602ea91bb833c1f9e7826823d containerName="kube-apiserver" containerID="containerd://b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5" gracePeriod=30
Nov 18 20:02:51 node2 kubelet[8572]: E1118 20:02:51.072102    8572 kuberuntime_container.go:753] "Kill container failed" err="rpc error: code = Unknown desc = failed to stop container \"b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5\": unknown error after kill: exec: \"runc\": executable file not found in $PATH: <nil>: unknown" pod="kube-system/kube-apiserver-node2" podUID=c1e18bf602ea91bb833c1f9e7826823d containerName="kube-apiserver" containerID={Type:containerd ID:b2ff61dfa68a702c75204ec893fec564fe0750d8d5934d867436542de8a4e4b5}

I know I'm talking about pretty old stuff but still. I'll try to keep orphans a bit longer I guess.

@agaoglu agaoglu added the kind/bug Categorizes issue or PR as related to a bug. label Nov 18, 2022
@agaoglu
Copy link
Author

agaoglu commented Nov 19, 2022

/usr/bin/runc is not orphaned in ubuntu 18.04. It's in containerd.io package and is removed with it. When containerd is restarted with the downloaded tarball - instead of the apt package, it still looks for /usr/bin/runc, cannot find it and cannot check already running containers. This should've been fixed by setting runc_bin_dir to /usr/bin but it doesn't. Because containerd apt is removed after runc is copied. In other words kubespray would then replace a deb owned binary, and remove it afterwards

TASK [container-engine/runc : Copy runc binary from download dir] ****************************************************************************************************************************************************************************
changed: [node4]
Saturday 19 November 2022  10:51:46 +0000 (0:00:01.319)       0:12:57.404 ***** 
--
TASK [container-engine/containerd : containerd | Remove any package manager controlled containerd package] ***********************************************************************************************************************************
changed: [node4]
Saturday 19 November 2022  10:52:03 +0000 (0:00:05.165)       0:13:14.059 ***** 

Tried to set runc BinaryName in containerd config to /usr/local/bin/runc but this didn't work either.

@agaoglu
Copy link
Author

agaoglu commented Nov 20, 2022

Fixed it by linking the updated /usr/local/bin/runc to /usr/bin/runc after removing the containerd apt package. And I had to delete the remove orphaned runc step of course.

@agaoglu agaoglu closed this as completed Nov 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant