Fix safe upgrade #2256

mlushpenko · 2018-02-05T23:33:14Z

Problem

A kubespray cluster is running for some time and you want to safely update it to the newer version using upgrade_cluster.yml

It will fail during [kubernetes/kubeadm : Join to cluster if needed] with error:

"stdout": "[preflight] Running pre-flight checks.\n[discovery] Trying to connect to API Server \"10.94.45.185:6443\"\n[discovery] Created cluster-info discovery client, requesting info from \"https://10.94.45.185:6443\"\n[discovery] Failed to connect to API Server \"10.94.45.185:6443\": there is no JWS signed token in the cluster-info ConfigMap. This token id \"abcdef\" is invalid for this cluster, can't connect\n

Expected result

kubeadm join will succeed as kubeadm_token_ttl is set to 0 which means that token should never expire, but it is not present in kubeadm token list after cluster is provisioned (at least after it is running for some time)

Related issues

kubernetes/kubeadm#335

Solution

Create a new temporary token before the kubeadm join command

Refactoring issues

Not sure what to do with kubeadm_token and kubeadm_token_ttl that are defined in roles\kubespray-defaults\defaults\main.yml. The code I added doesn't really breake anything as much as I tested, but looks like kubeadm_token_ttl is not respected, so perhaps it can be removed. kubeadm_token is also used for master config, so can stay untouched but it's a bit weird that that token is not used then during kubeadm join because I override it with newly generated one. Please suggest if you have ideas how to optimize it.

k8s-ci-robot · 2018-02-05T23:33:16Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

mlushpenko · 2018-02-05T23:57:14Z

Wrote an email with issues regarding CLA verification, but perhaps you can check request while it is being fixed.

ant31 · 2018-02-06T08:12:21Z

How's the initial installation is working without a token?

mlushpenko · 2018-02-06T09:09:29Z

@ant31 so before this commit the process is a follows:

kubeadm_token and kubeadm_token_ttl are set in roles\kubespray-defaults\defaults\main.yml and then used in templates during the kubeadm init --config kubeadm-config.yaml on master nodes and kubeadm join --config kubeadm-client.conf on worker nodes. This token is saved to secrets in kube-system namespace in the form of bootstrap-token-198eaa but it expires, I think after 24h, so can't be used during the upgrade process later on. We will redeploy the whole cluster from scratch just to confirm the idea.

With this commit, we can actually remove kubeadm_token_ttl completely and leave kubeadm_token just as a placeholder in kubeadm-client.conf. During initial setup, master node will just execute kubeadm init without any tokens as they are generated by default (the reason token was hardcoded was to have the same value in master and node configs I think). Then, during the kubeadm join --config kubeadm-client.conf a new temporary token will be generated and used, no matter if it is initial setup or upgrade process.

I will update that piece of code, test it on cluster again and come back.

k8s-ci-robot · 2018-02-06T14:44:19Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

mlushpenko · 2018-02-06T15:06:37Z

@ant31 I updated the code and here is a full testing process. Deployment to clean VMs:

Deploy version 1.8.1 first (no initial kubeadm_token hardcoded)

ansible-playbook cluster.yml -i inventory/inventory-test -e ansible_ssh_user=ansible -e kube_version=v1.9.1 -b -c paramiko -e kubeadm_checksum='312aeca9f56605e5d117ef901a2d8bceb701cca9662017ceb362c0d1aa91e13a'

Check that kubeadm tokens were generated (they will expire in 23h as visible):

Delete tokens to mimic cluster running for some time:

Upgrade cluster to the latest version 1.9.2

ansible-playbook upgrade-cluster.yml -i inventory/inventory-test -e ansible_ssh_user=ansible -b -c paramiko

See cluster update in progress one by one:

The cluster is deployed but master nodes are cordoned:

I checked the code and uncordone code is executed only for worker nodes, so I fixed it as well as it is part of safe upgrade process. To double check that it wasn't some unlucky error with cordoning, I run just pre-upgrade and post-upgrade tags (which do cordon and uncordon actions) on master nodes:

ansible-playbook upgrade-cluster.yml -i inventory/inventory-test -e ansible_ssh_user=deploy -b -c paramiko -l kube-master -t pre-upgrade,post-upgrade

Master nodes were still reporting SchedulingDisabled. After updating when condition in roles/upgrade/post-upgrade/tasks/main.yml and rerunning previous comand, everything was fixed.

Please let me know if I can improve it more or it can be merged.

ant31 · 2018-02-06T16:07:10Z

@mlushpenko thanks for the detailed explanation and all the tests you made!

ant31 · 2018-02-06T16:08:11Z

ci check this

ant31 · 2018-02-06T16:20:12Z

roles/kubernetes/kubeadm/tasks/main.yml

+  run_once: true
+  register: temp_token
+  delegate_to: "{{ groups['kube-master'][0] }}"
+


30:1 error trailing spaces (trailing-spaces)
https://gitlab.com/kargo-ci/kubernetes-incubator__kubespray/-/jobs/51312023

mlushpenko · 2018-02-06T21:09:51Z

@ant31 sorry, wasn't using proper editor :) please check now

mlushpenko · 2018-02-08T12:13:48Z

Hi @ant31, any update on this? I know you may be busy but also could happen that you just missed my previous notification among all others.

chapsuk · 2018-02-08T13:14:32Z

Hi, @mlushpenko can you please update example inventory kubeadm group_vars

https://github.com/kubernetes-incubator/kubespray/blob/57e7a5a34aa416e7c3227f55830291a6e25e0915/inventory/sample/group_vars/all.yml#L99-L101

mlushpenko · 2018-02-08T19:26:28Z

@chapsuk done, anything else?

ant31 · 2018-02-08T19:56:36Z

sorry @mlushpenko it need a rebase to solve the conflict and I ll righ after.

chapsuk

Thanks!

mlushpenko · 2018-02-09T08:39:00Z

thanks @ant31, looking forward :)

ant31 · 2018-02-09T12:09:51Z

@mlushpenko merge is blocked because inventory/group_vars/all.yml is in conflict

mlushpenko · 2018-02-09T13:48:59Z

@ant31 yes, I get it, but I can't do it from my side and just need to wait until you do rebase. No problem.

ant31 · 2018-02-09T14:44:48Z

@mlushpenko the rebase have to be done on your branch.
something like

git checkout master 
git pull origin master
git checkout fix-kubeadm-safe-upgrade
git rebase master
### Fix the conflict in inventory/group_vars/all.yml  (probably because it moved to sample/group_vars)
git rebase --continue
git push -f

Even though there it kubeadm_token_ttl=0 which means that kubeadm token never expires, it is not present in `kubeadm token list` after cluster is provisioned (at least after it is running for some time) and there is issue regarding this kubernetes/kubeadm#335, so we need to create a new temporary token during the cluster upgrade.

Tokens are generated automatically during init process and on-demand for nodes joining process

mlushpenko · 2018-02-09T16:21:01Z

@ant31 thanks, my first PR as you may have guessed..

ant31 · 2018-02-09T18:03:24Z

thanks :)

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 5, 2018

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 5, 2018

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 6, 2018

mlushpenko force-pushed the fix-kubeadm-safe-upgrade branch from 8bfd885 to ea5d8de Compare February 6, 2018 14:50

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 6, 2018

ant31 approved these changes Feb 6, 2018

View reviewed changes

ant31 reviewed Feb 6, 2018

View reviewed changes

mlushpenko force-pushed the fix-kubeadm-safe-upgrade branch from ea5d8de to 56b311c Compare February 6, 2018 21:00

chapsuk approved these changes Feb 8, 2018

View reviewed changes

mlushpenko added 3 commits February 9, 2018 15:51

Refactored kubeadm join process and fixed uncrodonng for master nodes

4e61fb9

Remove obsolete token variables

a37c642

Tokens are generated automatically during init process and on-demand for nodes joining process

mlushpenko force-pushed the fix-kubeadm-safe-upgrade branch from 08fedce to a37c642 Compare February 9, 2018 14:53

ant31 merged commit 60460c0 into kubernetes-sigs:master Feb 9, 2018

mlushpenko deleted the fix-kubeadm-safe-upgrade branch February 9, 2018 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix safe upgrade #2256

Fix safe upgrade #2256

mlushpenko commented Feb 5, 2018

k8s-ci-robot commented Feb 5, 2018

mlushpenko commented Feb 5, 2018

ant31 commented Feb 6, 2018

mlushpenko commented Feb 6, 2018

k8s-ci-robot commented Feb 6, 2018

mlushpenko commented Feb 6, 2018

ant31 commented Feb 6, 2018

ant31 commented Feb 6, 2018

ant31 Feb 6, 2018 •

edited

Loading

mlushpenko commented Feb 6, 2018

mlushpenko commented Feb 8, 2018

chapsuk commented Feb 8, 2018

mlushpenko commented Feb 8, 2018

ant31 commented Feb 8, 2018

chapsuk left a comment

mlushpenko commented Feb 9, 2018

ant31 commented Feb 9, 2018

mlushpenko commented Feb 9, 2018

ant31 commented Feb 9, 2018

mlushpenko commented Feb 9, 2018

ant31 commented Feb 9, 2018

Fix safe upgrade #2256

Fix safe upgrade #2256

Conversation

mlushpenko commented Feb 5, 2018

Problem

Expected result

Related issues

Solution

Refactoring issues

k8s-ci-robot commented Feb 5, 2018

mlushpenko commented Feb 5, 2018

ant31 commented Feb 6, 2018

mlushpenko commented Feb 6, 2018

k8s-ci-robot commented Feb 6, 2018

mlushpenko commented Feb 6, 2018

ant31 commented Feb 6, 2018

ant31 commented Feb 6, 2018

ant31 Feb 6, 2018 • edited Loading

Choose a reason for hiding this comment

mlushpenko commented Feb 6, 2018

mlushpenko commented Feb 8, 2018

chapsuk commented Feb 8, 2018

mlushpenko commented Feb 8, 2018

ant31 commented Feb 8, 2018

chapsuk left a comment

Choose a reason for hiding this comment

mlushpenko commented Feb 9, 2018

ant31 commented Feb 9, 2018

mlushpenko commented Feb 9, 2018

ant31 commented Feb 9, 2018

mlushpenko commented Feb 9, 2018

ant31 commented Feb 9, 2018

ant31 Feb 6, 2018 •

edited

Loading