Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm upgrade plan not working for v1.13.5 to v1.14.0 #1469

Closed
terrywang opened this issue Mar 27, 2019 · 14 comments
Closed

kubeadm upgrade plan not working for v1.13.5 to v1.14.0 #1469

terrywang opened this issue Mar 27, 2019 · 14 comments
Labels
area/upgrades help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@terrywang
Copy link

terrywang commented Mar 27, 2019

Is this a BUG REPORT or FEATURE REQUEST?

Bug Report

Versions

kubeadm version (use kubeadm version):
v.1.14.0

Environment:

  • Kubernetes version (use kubectl version): v1.13.5

  • Cloud provider or hardware configuration: AWS EC2

  • OS (e.g. from /etc/os-release): Ubuntu 16.04.6 LTS

  • Kernel (e.g. uname -a): Linux k8s-node-1 4.4.0-143-generic Always enable RBAC so the cluster-info ConfigMap can be exposed #169-Ubuntu SMP Thu Feb 7 07:56:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

  • Others: kubeadm provisioned single master k8s cluster (3 nodes), this cluster was created using kubeadm when k8s was at v1.9.0.

What happened?

Use kubeadm to ugrade the cluster, v1.13.4 to v1.13.5 was successful. To v1.14.0 failed becaues kubeadm upgrade plan pre-flight checks trying to connec to etcd using the node's private IP (assigned to NIC eth0) instead of the loopback address etcd is binding to.

Error

root@k8s-node-1:~# sudo kubeadm upgrade plan
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
error syncing endpoints with etc: dial tcp 192.168.100.21:2379: connect: connection refused

As per the kubeadm init workflow, single master k8s cluster etcd pod is created via static pod manifests. By looking at the manifest, etcd binds 127.0.0.1 and is not exposed to external world.

root@k8s-node-1:/etc/kubernetes/manifests# cat etcd.yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://127.0.0.1:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://127.0.0.1:2380
    - --initial-cluster=k8s-node-1=https://127.0.0.1:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379
    - --listen-peer-urls=https://127.0.0.1:2380
    - --name=k8s-node-1
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: k8s.gcr.io/etcd:3.2.24
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
        - /bin/sh
        - -ec
        - ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
          --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
          get foo
      failureThreshold: 8
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: etcd
    resources: {}
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
status: {}

What you expected to happen?

kubeadm upgrade plan should work as expected to output the details, just like v1.13.4 to v.1.3.5.

ubuntu@k8s-node-1:~$ sudo kubeadm upgrade plan
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.13.4
[upgrade/versions] kubeadm version: v1.13.5
I0327 09:52:14.655319   12224 version.go:237] remote version is much newer: v1.14.0; falling back to: stable-1.13
[upgrade/versions] Latest stable version: v1.13.5
[upgrade/versions] Latest version in the v1.13 series: v1.13.5

Components that must be upgraded manually after you have upgraded the control plane with 'kubeadm upgrade apply':
COMPONENT   CURRENT       AVAILABLE
Kubelet     2 x v1.13.4   v1.13.5
            1 x v1.13.5   v1.13.5

Upgrade to the latest version in the v1.13 series:

COMPONENT            CURRENT   AVAILABLE
API Server           v1.13.4   v1.13.5
Controller Manager   v1.13.4   v1.13.5
Scheduler            v1.13.4   v1.13.5
Kube Proxy           v1.13.4   v1.13.5
CoreDNS              1.2.6     1.2.6
Etcd                 3.2.24    3.2.24

You can now apply the upgrade by executing the following command:

        kubeadm upgrade apply v1.13.5

_____________________________________________________________________

ubuntu@k8s-node-1:~$ sudo kubeadm upgrade apply v1.13.5

How to reproduce it (as minimally and precisely as possible)?

Follow the upgrade guide, upgrade any v.1.13.x cluster (created using kubeadm) to v1.14.0.

I've tried to change the bind address but it has so many dependencies that breaks more than it fixes. Also tried to expose the pod as NodePort service, tried using iptables rules to forward traffic destined to the IP address (192.168.100.12 in this case) port 2379 to loopback with no luck.

Is there a way to override the etcd endpoint when running kubeadm upgrade plan that'll be the easiest solution.

Anything else we need to know?

Hmm...

@neolit123 neolit123 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/upgrades help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 27, 2019
@neolit123
Copy link
Member

thanks for the report.
i will try to reproduce your problem.

@neolit123
Copy link
Member

$ kubectl get nodes
NAME         STATUS   ROLES    AGE    VERSION
luboitvbox   Ready    master   6m9s   v1.13.5

$ kubeadm version --output=short
v1.14.0

$ sudo kubeadm upgrade plan
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.13.5
[upgrade/versions] kubeadm version: v1.14.0

Awesome, you're up-to-date! Enjoy!

here is what get, this is a bit of a bug on it's own because it's telling me that i'm up to date while it should be telling me to update to v1.14.0. i will log a bug about this Awesome, you're up-to-date! Enjoy! case.

but my etcd manifest looks like this:

$ sudo cat /etc/kubernetes/manifests/etcd.yaml
...
containers:
  - command:
    - etcd
    - --advertise-client-urls=https://192.168.0.102:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://192.168.0.102:2380
    - --initial-cluster=luboitvbox=https://192.168.0.102:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379,https://192.168.0.102:2379
    - --listen-peer-urls=https://192.168.0.102:2380
    - --name=luboitvbox
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: k8s.gcr.io/etcd:3.2.24
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
        - /bin/sh
        - -ec
        - ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
          --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
...

did you happen to create this cluster using 1.12 before upgrading to 1.13?

i remember that we did some changes to the etcd addresses related to HA setups.
try making your manifest like the above.

$ sudo kubeadm upgrade apply v1.14.0
...
[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.14.0". Enjoy!

[upgrade/kubelet] Now that your control plane is upgraded, please proceed with upgrading your kubelets if you haven't already done so.

...
# (upgrade kubelet)
$ sudo systemctl restart kubelet
$ kubectl get nodes
NAME         STATUS   ROLES    AGE   VERSION
luboitvbox   Ready    master   23m   v1.14.0

the upgrade worked for me.

@neolit123
Copy link
Member

here is what get, this is a bit of a bug on it's own because it's telling me that i'm up to date while it should be telling me to update to v1.14.0. i will log a bug about this Awesome, you're up-to-date! Enjoy! case.

logged:
#1470

@terrywang
Copy link
Author

@neolit123 Thanks for the comments. The cluster was initially created using kubeadm v1.9.x. Later on rebuilt, definitely used version before v.1.12.0. No wonder the etcd static pod manifest is different.

What exactly has changed for etcd manifest (since v1.12.0)? Tried to search for that but no luck.

I'll try to generate new static pod manifests using latest version from a different machine and see if I can figure it out (also the dependencies).

@neolit123 neolit123 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 28, 2019
@neolit123
Copy link
Member

@terrywang
it was done here so that we can properly support stacked etcd members in an HA setup:
kubernetes/kubernetes#69486

more details here:
#1123

that said i think we had a way to handle this type of upgrade transparently between 1.12 and 1.13, so your 1.13 etcd manifest should have been auto-converted to use the network interface address. possibly something went wrong in the process, but also this is the first report we are seeing related to this.

please let me know if you remember anything like modifying the etcd manifests manually, which could have broke our 1.12->1.13 logic.

@neolit123
Copy link
Member

closing in favor of: #1471

@terrywang
Copy link
Author

terrywang commented Apr 1, 2019

@neolit123 Thanks again for the info. Good to know.

I've regenerated the static pod manifests using latest version of kubeadm to run the phase on a different VM, compared the differences and made necessary changes to the one in my cluster.

NOTE: in my case - --listen-client-urls=https://127.0.0.1:2379,https://192.168.100.21:2379.

However, running kubeadm upgrade plan after etcd pod was started, I got the following error:

root@k8s-node-1:~# kubeadm upgrade plan
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
error syncing endpoints with etc: context deadline exceeded

Will follow #1471 to regenerate the certificates for etcd when I have time.

@vdboor
Copy link

vdboor commented Apr 2, 2019

Thanks terry for the info, I had the same problem. Also following #1471

My original cluster originated from Kubernetes 1.8, and rebuild during the 1.11 upgrade because broke everything. My etcd also listens to localhost only:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  creationTimestamp: null
  labels:
    component: etcd
    tier: control-plane
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --advertise-client-urls=https://127.0.0.1:2379
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --client-cert-auth=true
    - --data-dir=/var/lib/etcd
    - --initial-advertise-peer-urls=https://127.0.0.1:2380
    - --initial-cluster=phenomenal.edoburu.nl=https://127.0.0.1:2380
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --listen-client-urls=https://127.0.0.1:2379
    - --listen-peer-urls=https://127.0.0.1:2380
    - --name=phenomenal.edoburu.nl
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-client-cert-auth=true
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --snapshot-count=10000
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    image: k8s.gcr.io/etcd:3.2.24
    imagePullPolicy: IfNotPresent
    livenessProbe:
      exec:
        command:
        - /bin/sh
        - -ec
        - ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
          --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
          get foo
      failureThreshold: 8
      initialDelaySeconds: 15
      timeoutSeconds: 15
    name: etcd
    resources: {}
    volumeMounts:
    - mountPath: /var/lib/etcd
      name: etcd-data
    - mountPath: /etc/kubernetes/pki/etcd
      name: etcd-certs
  hostNetwork: true
  priorityClassName: system-cluster-critical
  volumes:
  - hostPath:
      path: /var/lib/etcd
      type: DirectoryOrCreate
    name: etcd-data
  - hostPath:
      path: /etc/kubernetes/pki/etcd
      type: DirectoryOrCreate
    name: etcd-certs
status: {}

@jeanfabrice
Copy link

+1
Same problem here, while trying to upgrade from 1.13.4 to 1.14.0:

# kubeadm upgrade plan
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
error syncing endpoints with etc: dial tcp 192.168.10.2:2379: connect: connection refused
#

@neolit123
Copy link
Member

fix should be up in 1.14.1
(to be released soon)

@terrywang
Copy link
Author

terrywang commented Apr 23, 2019

Update: I waited for kubeadm 1.14.1, it didn't actually fix the issue...

Luckily, simply by following the steps in #1471 mentioned by @mauilion I was able to leverage kubeadm phase (etcd-server and etcd) regenerate etcd TLS certificate to cover k8s node IP, reconfigure etcd with new listen-client-urls and start etcd, subsequently run the kubeadm upgrade plan.

The reason why kubeadm upgrade plan failed with the following error was because of the etcd server TLS certificate's SAN did not cover the k8s node's (on which etcd was running) IP, it simply failed to start.

[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
error syncing endpoints with etc: context deadline exceeded

The certificate SAN should look like below

        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Subject Alternative Name:
                DNS:k8s-node-1, DNS:localhost, IP Address:192.168.100.21, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1, IP Address:10.192.0.2

@neolit123
Copy link
Member

Update: I waited for kubeadm 1.14.1, it didn't actually fix the issue...

hm, it should have. the PR that @fabriziopandini created was merged and tested by at least a couple of people.

The certificate SAN should look like below

and your existing cert was missing 192.168.100.21?

@terrywang
Copy link
Author

Yes, existing etcd server certificate SAN was missing 192.168.100.21.

I may have forgotten to restore the static pod manifest for etcd (added node's private IP inside a VPC subnet to --listen-client-urls as a workaround when troubleshooting), this may be the reason why kubeadm upgrade plan still failed with v1.14.1.

Anyway, the problem is well solved.

Really appreciate your input and assistance, enjoyed the learning experience ;-)

@neolit123
Copy link
Member

I may have forgotten to restore the static pod manifest for etcd (added node's private IP inside a VPC subnet to --listen-client-urls as a workaround when troubleshooting), this may be the reason why kubeadm upgrade plan still failed with v1.14.1.

yes, that may be the cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrades help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

4 participants