Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm stuck updating etcd static pod #2718

Closed
brianmay opened this issue Jun 30, 2022 · 12 comments
Closed

kubeadm stuck updating etcd static pod #2718

brianmay opened this issue Jun 30, 2022 · 12 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@brianmay
Copy link

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.8", GitCommit:"a12b886b1da059e0190c54d09c5eab5219dd7acf", GitTreeState:"clean", BuildDate:"2022-06-16T05:56:32Z", GoVersion:"go1.17.11", Compiler:"gc", Platform:"linux/amd64"}

Note: Similar results with 1.24.2

Environment:

  • Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:49:13Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:43:11Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration: bare metal
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 11 (bullseye)
  • Kernel (e.g. uname -a): Linux kube-master 5.10.0-15-amd64 kubeadm join on slave node fails preflight checks #1 SMP Debian 5.10.120-1 (2022-06-09) x86_64 GNU/Linux
  • Container runtime (CRI) (e.g. containerd, cri-o): cri-o
  • Container networking plugin (CNI) (e.g. Calico, Cilium): kubenet
  • Others:

What happened?

Tried to upgrade to 1.24.2, it failed to restart etcd. No obvious errors.

Tried to upgrade to 1.23.8, similar issues.

What you expected to happen?

etcd should upgrade.

How to reproduce it (as minimally and precisely as possible)?

Upgrade kubernetes from 1.23.6 to anything.

Anything else we need to know?

root@kube-master:~# kubeadm upgrade apply -v 5  v1.23.8
I0630 09:42:54.825807    1504 apply.go:104] [upgrade/apply] verifying health of cluster
I0630 09:42:54.825983    1504 apply.go:105] [upgrade/apply] retrieving configuration from cluster
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
I0630 09:42:55.903339    1504 kubelet.go:91] attempting to download the KubeletConfiguration from the new format location (UnversionedKubeletConfigMap=true)
I0630 09:42:56.468056    1504 kubelet.go:94] attempting to download the KubeletConfiguration from the DEPRECATED location (UnversionedKubeletConfigMap=false)
W0630 09:42:56.639211    1504 utils.go:69] The recommended value for "resolvConf" in "KubeletConfiguration" is: /run/systemd/resolve/resolv.conf; the provided value is: /run/systemd/resolve/resolv.conf
I0630 09:42:56.639668    1504 common.go:164] running preflight checks
[preflight] Running pre-flight checks.
I0630 09:42:56.639790    1504 preflight.go:77] validating if there are any unsupported CoreDNS plugins in the Corefile
I0630 09:42:56.679972    1504 preflight.go:105] validating if migration can be done for the current CoreDNS release.
[upgrade] Running cluster health checks
I0630 09:42:56.742437    1504 health.go:162] Creating Job "upgrade-health-check" in the namespace "kube-system"
I0630 09:42:56.889088    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:42:57.913202    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:42:58.907519    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:42:59.918376    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:00.936711    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:01.909032    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:02.915717    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:03.907989    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:04.909430    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:05.914427    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:06.920811    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:07.920979    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:08.905877    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:09.919813    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:10.913589    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:11.914411    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:11.936287    1504 health.go:192] Job "upgrade-health-check" in the namespace "kube-system" is not yet complete, retrying
I0630 09:43:11.936397    1504 health.go:205] Deleting Job "upgrade-health-check" in the namespace "kube-system"
I0630 09:43:12.121248    1504 apply.go:112] [upgrade/apply] validating requested and actual version
I0630 09:43:12.121507    1504 apply.go:128] [upgrade/version] enforcing version skew policies
[upgrade/version] You have chosen to change the cluster version to "v1.23.8"
[upgrade/versions] Cluster version: v1.23.6
[upgrade/versions] kubeadm version: v1.23.8
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Pulling images required for setting up a Kubernetes cluster
[upgrade/prepull] This might take a minute or two, depending on the speed of your internet connection
[upgrade/prepull] You can also perform this action in beforehand using 'kubeadm config images pull'
I0630 09:43:14.515794    1504 checks.go:842] using image pull policy: IfNotPresent
I0630 09:43:14.817643    1504 checks.go:851] image exists: k8s.gcr.io/kube-apiserver:v1.23.8
I0630 09:43:14.867205    1504 checks.go:851] image exists: k8s.gcr.io/kube-controller-manager:v1.23.8
I0630 09:43:14.929965    1504 checks.go:851] image exists: k8s.gcr.io/kube-scheduler:v1.23.8
I0630 09:43:14.980000    1504 checks.go:851] image exists: k8s.gcr.io/kube-proxy:v1.23.8
I0630 09:43:15.068379    1504 checks.go:851] image exists: k8s.gcr.io/pause:3.6
I0630 09:43:15.134115    1504 checks.go:851] image exists: k8s.gcr.io/etcd:3.5.1-0
I0630 09:43:15.193686    1504 checks.go:851] image exists: k8s.gcr.io/coredns/coredns:v1.8.6
I0630 09:43:15.193815    1504 apply.go:154] [upgrade/apply] performing upgrade
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.23.8"...
Static pod: kube-apiserver-kube-master hash: f54de4df3bfe9738777583258756acd4
Static pod: kube-controller-manager-kube-master hash: 7ee7ebbda1c227edb8422cd0cdf0c247
Static pod: kube-scheduler-kube-master hash: d799da44ad4697928599ac5099261415
I0630 09:43:15.331213    1504 etcd.go:168] retrieving etcd endpoints from "kubeadm.kubernetes.io/etcd.advertise-client-urls" annotation in etcd Pods
I0630 09:43:15.363964    1504 etcd.go:104] etcd endpoints read from pods: https://192.168.3.32:2379,https://192.168.3.35:2379,https://192.168.3.40:2379
I0630 09:43:15.452049    1504 etcd.go:224] etcd endpoints read from etcd: https://192.168.3.40:2379,https://192.168.3.35:2379,https://192.168.3.32:2379
I0630 09:43:15.452143    1504 etcd.go:122] update etcd endpoints: https://192.168.3.40:2379,https://192.168.3.35:2379,https://192.168.3.32:2379
[upgrade/etcd] Upgrading to TLS for etcd
Static pod: etcd-kube-master hash: ef6e73fbb4a726fcc486cbe73a657ad1
I0630 09:43:19.547703    1504 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests2595343906/etcd.yaml"
[upgrade/staticpods] Preparing for "etcd" upgrade
[upgrade/staticpods] Renewing etcd-server certificate
I0630 09:43:19.548339    1504 certs.go:522] validating certificate period for etcd CA certificate
I0630 09:43:19.567195    1504 certs.go:522] validating certificate period for etcd/ca certificate
[upgrade/staticpods] Renewing etcd-peer certificate
[upgrade/staticpods] Renewing etcd-healthcheck-client certificate
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/etcd.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests-2022-06-30-09-43-15/etcd.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/staticpods] This might take a minute or longer depending on the component/version gap (timeout 5m0s)
[...]
Static pod: etcd-kube-master hash: ef6e73fbb4a726fcc486cbe73a657ad1
[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: timed out waiting for the condition
[upgrade/etcd] Waiting for previous etcd to become available
I0630 09:48:23.422549    1504 etcd.go:484] [etcd] attempting to see if all cluster endpoints ([https://192.168.3.40:2379 https://192.168.3.35:2379 https://192.168.3.32:2379]) are available 1/10
I0630 09:48:25.829016    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
I0630 09:48:28.046227    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
I0630 09:48:30.328985    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
I0630 09:48:32.692569    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
I0630 09:48:35.122508    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
I0630 09:48:37.756825    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
I0630 09:48:40.621149    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
I0630 09:48:43.917112    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
I0630 09:48:47.847781    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
I0630 09:48:52.744053    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
I0630 09:48:58.801792    1504 etcd.go:464] Failed to get etcd status for https://192.168.3.32:2379: failed to dial endpoint https://192.168.3.32:2379 with maintenance client: context deadline exceeded
[upgrade/etcd] Etcd was rolled back and is now available
[...]

As far as I can tell, the pod is not encountering any errors, and there doesn't actually appear to be anything going wrong.

@brianmay
Copy link
Author

It seems that changing /etc/kubernetes/manifests/etcd.yaml correctly results in etcd being restarted but the pod still has the old config. Which is a bit weird.

Problem with kubelet maybe? Not seeing any errors from journalctl.

@neolit123
Copy link
Member

Our upgrade CI is green / passing.
Are you seeing any errors in the etcd container?

@neolit123
Copy link
Member

but the pod still has the old config

Thats probably because it was rolled back.

@brianmay
Copy link
Author

No, I looked before it was rolled back.

I also made a manual change to the file, and observed the results. The pod is restarted, but kubectl get pod -o yaml $podid still shows the old values.

There are no relevant errors in the etcd container.

Kind of seems weird.

@brianmay
Copy link
Author

Here is the logs from my journalctl after making a single change to the manifest. There appear to be a lot of logs for just a single change. My growing suspicion is that the error is hidden here somewhere, and that kubelet may be the culprit.

log.txt

The ""Nameserver limits exceeded"" messages are not an issue. Should try to work out how to avoid them though. I think this might be related to the fact I have listed both IPv4 and IPv6 nameservers.

Wondering about these messages though. Are these an indication of a problem, or is this just normal when restarting etcd? i.e. error might be that it is trying to contact the etcd which was just shut down.

Jun 30 20:18:21 kube-master kubelet[11761]: E0630 20:18:21.416173   11761 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 40s restarting failed container=etcd pod=etcd-kube-master_kube-system(ef6e73fbb4a726fcc486cbe73a657ad1)\"" pod="kube-system/etcd-kube-master" podUID=ef6e73fbb4a726fcc486cbe73a657ad1

But later after retrying several times it thinks it is OK:

Jun 30 20:18:45 kube-master containerd[541]: time="2022-06-30T20:18:45.964746650+10:00" level=info msg="CreateContainer within sandbox \"e20926fa1bd071c9c7bee6529e420a0b98b55bbe408d25df8a16b19a82018891\" for container &ContainerMetadata{Name:etcd,Attempt:4,}"

So maybe just a red hearing.

@brianmay
Copy link
Author

brianmay commented Jun 30, 2022

If I make a change to another manifest, e.g. kube-apiserver.yaml then I have seen it working.

Wondering if the problem is it is writing to the manifest to the etcd daemon that then gets shutdown, and it gets shutdown before it can sync its changes with the rest of the cluster. So kubelet ends up retrieving the old manifest data when trying to restart the pod. Or something crazy like that.

@neolit123
Copy link
Member

neolit123 commented Jun 30, 2022

Sorry but I don't see a kubeadm bug here. We normally don't provide support in this issue tracker.

Try #kubeadm #etc or other support channels like stackoverflow

/support

@github-actions
Copy link

Hello, @brianmay 🤖 👋

You seem to have troubles using Kubernetes and kubeadm.
Note that our issue trackers should not be used for providing support to users.
There are special channels for that purpose.

Please see:

@github-actions github-actions bot added the kind/support Categorizes issue or PR as a support question. label Jun 30, 2022
@neolit123
Copy link
Member

Of course, if there is a confirmed reproducible bug let's reopen.

@github-actions
Copy link

Hello, @brianmay 🤖 👋

You seem to have troubles using Kubernetes and kubeadm.
Note that our issue trackers should not be used for providing support to users.
There are special channels for that purpose.

Please see:

@brianmay
Copy link
Author

Finally found the problem. I had a /etc/kubernetes/manifests/etcd.yaml.bak file. No idea what created it, doesn't appear to be vim, which generally use for editing these files.

While this is not a bug, I would actually suggest that kubeadm should check for the existence of typical backup editor files that could cause problems for the upgrade. And warn the operator if anything is found.

Also am a bit surprised that Kubernetes will read such files, I would normally have assumed it should limit files it processed to those matching *.yaml

This is the stack overflow answer that helped me: https://stackoverflow.com/a/56326068/5766144

@neolit123
Copy link
Member

neolit123 commented Jun 30, 2022

Also am a bit surprised that Kubernetes will read such files, I would normally have assumed it should limit files it processed to those matching *.yaml

i think this is related to this report / PR for the kubelet:
kubernetes/kubernetes#63910

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

2 participants