Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix safe upgrade #2256

Merged
merged 3 commits into from
Feb 9, 2018
Merged

Conversation

mlushpenko
Copy link
Contributor

Problem

A kubespray cluster is running for some time and you want to safely update it to the newer version using upgrade_cluster.yml

It will fail during [kubernetes/kubeadm : Join to cluster if needed] with error:

"stdout": "[preflight] Running pre-flight checks.\n[discovery] Trying to connect to API Server \"10.94.45.185:6443\"\n[discovery] Created cluster-info discovery client, requesting info from \"https://10.94.45.185:6443\"\n[discovery] Failed to connect to API Server \"10.94.45.185:6443\": there is no JWS signed token in the cluster-info ConfigMap. This token id \"abcdef\" is invalid for this cluster, can't connect\n

Expected result

kubeadm join will succeed as kubeadm_token_ttl is set to 0 which means that token should never expire, but it is not present in kubeadm token list after cluster is provisioned (at least after it is running for some time)

Related issues

kubernetes/kubeadm#335

Solution

Create a new temporary token before the kubeadm join command

Refactoring issues

Not sure what to do with kubeadm_token and kubeadm_token_ttl that are defined in roles\kubespray-defaults\defaults\main.yml. The code I added doesn't really breake anything as much as I tested, but looks like kubeadm_token_ttl is not respected, so perhaps it can be removed. kubeadm_token is also used for master config, so can stay untouched but it's a bit weird that that token is not used then during kubeadm join because I override it with newly generated one. Please suggest if you have ideas how to optimize it.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 5, 2018
@mlushpenko
Copy link
Contributor Author

Wrote an email with issues regarding CLA verification, but perhaps you can check request while it is being fixed.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 5, 2018
@ant31
Copy link
Contributor

ant31 commented Feb 6, 2018

How's the initial installation is working without a token?

@mlushpenko
Copy link
Contributor Author

@ant31 so before this commit the process is a follows:

kubeadm_token and kubeadm_token_ttl are set in roles\kubespray-defaults\defaults\main.yml and then used in templates during the kubeadm init --config kubeadm-config.yaml on master nodes and kubeadm join --config kubeadm-client.conf on worker nodes. This token is saved to secrets in kube-system namespace in the form of bootstrap-token-198eaa but it expires, I think after 24h, so can't be used during the upgrade process later on. We will redeploy the whole cluster from scratch just to confirm the idea.

With this commit, we can actually remove kubeadm_token_ttl completely and leave kubeadm_token just as a placeholder in kubeadm-client.conf. During initial setup, master node will just execute kubeadm init without any tokens as they are generated by default (the reason token was hardcoded was to have the same value in master and node configs I think). Then, during the kubeadm join --config kubeadm-client.conf a new temporary token will be generated and used, no matter if it is initial setup or upgrade process.

I will update that piece of code, test it on cluster again and come back.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 6, 2018
@mlushpenko mlushpenko force-pushed the fix-kubeadm-safe-upgrade branch from 8bfd885 to ea5d8de Compare February 6, 2018 14:50
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 6, 2018
@mlushpenko
Copy link
Contributor Author

@ant31 I updated the code and here is a full testing process. Deployment to clean VMs:

  1. Deploy version 1.8.1 first (no initial kubeadm_token hardcoded)
ansible-playbook cluster.yml -i inventory/inventory-test -e ansible_ssh_user=ansible -e kube_version=v1.9.1 -b -c paramiko -e kubeadm_checksum='312aeca9f56605e5d117ef901a2d8bceb701cca9662017ceb362c0d1aa91e13a'

image

  1. Check that kubeadm tokens were generated (they will expire in 23h as visible):

image

  1. Delete tokens to mimic cluster running for some time:

image

  1. Upgrade cluster to the latest version 1.9.2
ansible-playbook upgrade-cluster.yml -i inventory/inventory-test -e ansible_ssh_user=ansible -b -c paramiko

See cluster update in progress one by one:

image

  1. The cluster is deployed but master nodes are cordoned:

image

I checked the code and uncordone code is executed only for worker nodes, so I fixed it as well as it is part of safe upgrade process. To double check that it wasn't some unlucky error with cordoning, I run just pre-upgrade and post-upgrade tags (which do cordon and uncordon actions) on master nodes:

ansible-playbook upgrade-cluster.yml -i inventory/inventory-test -e ansible_ssh_user=deploy -b -c paramiko -l kube-master -t pre-upgrade,post-upgrade

Master nodes were still reporting SchedulingDisabled. After updating when condition in roles/upgrade/post-upgrade/tasks/main.yml and rerunning previous comand, everything was fixed.

Please let me know if I can improve it more or it can be merged.

@ant31
Copy link
Contributor

ant31 commented Feb 6, 2018

@mlushpenko thanks for the detailed explanation and all the tests you made!

@ant31
Copy link
Contributor

ant31 commented Feb 6, 2018

ci check this

run_once: true
register: temp_token
delegate_to: "{{ groups['kube-master'][0] }}"

Copy link
Contributor

@ant31 ant31 Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlushpenko mlushpenko force-pushed the fix-kubeadm-safe-upgrade branch from ea5d8de to 56b311c Compare February 6, 2018 21:00
@mlushpenko
Copy link
Contributor Author

@ant31 sorry, wasn't using proper editor :) please check now

@mlushpenko
Copy link
Contributor Author

Hi @ant31, any update on this? I know you may be busy but also could happen that you just missed my previous notification among all others.

@chapsuk
Copy link
Contributor

chapsuk commented Feb 8, 2018

@mlushpenko
Copy link
Contributor Author

@chapsuk done, anything else?

@ant31
Copy link
Contributor

ant31 commented Feb 8, 2018

sorry @mlushpenko it need a rebase to solve the conflict and I ll righ after.

Copy link
Contributor

@chapsuk chapsuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mlushpenko
Copy link
Contributor Author

thanks @ant31, looking forward :)

@ant31
Copy link
Contributor

ant31 commented Feb 9, 2018

@mlushpenko merge is blocked because inventory/group_vars/all.yml is in conflict

@mlushpenko
Copy link
Contributor Author

@ant31 yes, I get it, but I can't do it from my side and just need to wait until you do rebase. No problem.

@ant31
Copy link
Contributor

ant31 commented Feb 9, 2018

@mlushpenko the rebase have to be done on your branch.
something like

git checkout master 
git pull origin master
git checkout fix-kubeadm-safe-upgrade
git rebase master
### Fix the conflict in inventory/group_vars/all.yml  (probably because it moved to sample/group_vars)
git rebase --continue
git push -f

Even though there it kubeadm_token_ttl=0 which means that kubeadm token never expires, it is not present in `kubeadm token list` after cluster is provisioned (at least after it is running for some time) and there is issue regarding this kubernetes/kubeadm#335, so we need to create a new temporary token during the cluster upgrade.
Tokens are generated automatically during init process and on-demand for nodes joining process
@mlushpenko mlushpenko force-pushed the fix-kubeadm-safe-upgrade branch from 08fedce to a37c642 Compare February 9, 2018 14:53
@mlushpenko
Copy link
Contributor Author

@ant31 thanks, my first PR as you may have guessed..

@ant31 ant31 merged commit 60460c0 into kubernetes-sigs:master Feb 9, 2018
@ant31
Copy link
Contributor

ant31 commented Feb 9, 2018

thanks :)

@mlushpenko mlushpenko deleted the fix-kubeadm-safe-upgrade branch February 9, 2018 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants