Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for the new kops reconcile cluster command #17191

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rifelpet
Copy link
Member

@rifelpet rifelpet commented Jan 9, 2025

/hold for feedback

A few open questions:

  1. Do we update all the rolling-update cluster docs references to reconcile cluster?
  2. Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)
  3. Do we add a new permalink that the error message links to?
  4. Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rifelpet. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from hakman January 9, 2025 02:39
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 9, 2025
@stl-victor-sudakov
Copy link

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

Because updating both the cluster's control plane launch templates (or other cloud provider equivalents) and node launch templates at the same time will cause new nodes to fail to join the cluster until all control plane instances have been upgraded. So if Cluster Autoscaler or Karpenter scale up nodes before or during the control plane rolling-update, they will fail to join and workloads will be stuck in Pending. This is almost certainly not what the user wants and is why we're introducing the new command.

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

We could allow the user to bypass the error if they know what they're doing. for example, on clusters that dont use Cluster Autoscaler or Karpenter.

@stl-victor-sudakov
Copy link

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

Because updating both the cluster's control plane launch templates (or other cloud provider equivalents) and node launch templates at the same time will cause new nodes to fail to join the cluster until all control plane instances have been upgraded. So if Cluster Autoscaler or Karpenter scale up nodes before or during the control plane rolling-update, they will fail to join and workloads will be stuck in Pending. This is almost certainly not what the user wants and is why we're introducing the new command.

Hold on, hasn't this sequence:

kops upgrade cluster $NAME --yes
kops update cluster $NAME --yes
kops rolling-update cluster $NAME --yes

always been the standard upgrade sequence? These steps are even documented in https://kops.sigs.k8s.io/operations/updates_and_upgrades/#automated-update And now it is an error?

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

Now it may cause node failures during the k8s 1.31 upgrade, yes. Hence the bold release note being added in this PR and my proposal to prevent users from making this mistake by returning a (skippable) error.

I'll update that docs page to note this change as well.

@stl-victor-sudakov
Copy link

Now it may cause node failures during the k8s 1.31 upgrade, yes. Hence the bold release note being added in this PR and my proposal to prevent users from making this mistake by returning a (skippable) error.

I'll update that docs page to note this change as well.

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

@rifelpet
Copy link
Member Author

rifelpet commented Jan 9, 2025

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

I updated this PR to link to the k/k issue that goes into more detail: kubernetes/kubernetes#127316

@stl-victor-sudakov
Copy link

stl-victor-sudakov commented Jan 10, 2025

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

I updated this PR to link to the k/k issue that goes into more detail: kubernetes/kubernetes#127316

Oh, what a longread! Maybe #16907 would be shorter and more to the point, it is also mentioned within the longer post.

However I think I understand the innovation now. The new reconcile command does "update --yes && rolling-update --yes" on CP nodes first, and then does the same on worker nodes, thus enforcing that the CP is fully updated first. Is this correct?

@danports
Copy link
Contributor

The new reconcile command does "update --yes && rolling-update --yes" on CP nodes first, and then does the same on worker nodes, thus enforcing that the CP is fully updated first. Is this correct?

Yes, that's correct.

@@ -1,15 +1,59 @@
# Upgrading kubernetes

## **NOTE for Kubernetes >1.31**

Kops' upgrade procedure has hostorically risked violating the [Kubelet version skew policy](https://kubernetes.io/releases/version-skew-policy/#kubelet). Between `kops update cluster --yes` and every kube-apiserver being rotated with `kops rolling-update cluster --yes`, newly launched nodes running new kubelet versions could be connecting to older `kube-apiserver` nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kops' upgrade procedure has hostorically risked violating the [Kubelet version skew policy](https://kubernetes.io/releases/version-skew-policy/#kubelet). Between `kops update cluster --yes` and every kube-apiserver being rotated with `kops rolling-update cluster --yes`, newly launched nodes running new kubelet versions could be connecting to older `kube-apiserver` nodes.
Kops' upgrade procedure has historically risked violating the [Kubelet version skew policy](https://kubernetes.io/releases/version-skew-policy/#kubelet). After `kops update cluster --yes` completes and before every kube-apiserver is replaced with `kops rolling-update cluster --yes`, newly launched nodes running newer kubelet versions could be connecting to older `kube-apiserver` nodes.

2. `kops rolling-update cluster --instance-group-roles=control-plane,apiserver --yes`
3. `kops update cluster --yes`
4. `kops rolling-update cluster --yes`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. `kops update cluster --prune --yes`


Upgrading kubernetes is similar to changing the image on an InstanceGroup, except that the kubernetes version is
Upgrading kubernetes is similar to changing the image on an InstanceGroup, the kubernetes version is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the language here was clearer with "except that".

@k8s-ci-robot
Copy link
Contributor

@danports: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@danports
Copy link
Contributor

  1. Do we update all the rolling-update cluster docs references to reconcile cluster?

Yes, eventually, but I don't think it needs to happen right away, since reconcile cluster probably needs more time to mature and should support additional options that update cluster/rolling-update cluster already support. See #17146 for instance.

  1. Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)

That would be a smart idea, though perhaps it should only be a warning if the cluster doesn't have CAS/Karpenter enabled.

  1. Do we add a new permalink that the error message links to?

👍 More context for error messages is always good.

  1. Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

I am confused about this myself. Based on the commit history I think maybe @justinsb just added the --reconcile flag as an initial step before adding the full reconcile cluster command, so maybe we don't need --reconcile anymore?

@k8s-ci-robot
Copy link
Contributor

@rifelpet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kops-e2e-gce-cni-calico ae625a6 link true /test pull-kops-e2e-gce-cni-calico
pull-kops-e2e-gce-cni-kindnet ae625a6 link true /test pull-kops-e2e-gce-cni-kindnet

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants