Add docs for the new kops reconcile cluster command #17191

rifelpet · 2025-01-09T02:39:23Z

/hold for feedback

A few open questions:

Do we update all the rolling-update cluster docs references to reconcile cluster?
Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)
Do we add a new permalink that the error message links to?
Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

k8s-ci-robot · 2025-01-09T02:39:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rifelpet. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stl-victor-sudakov · 2025-01-09T09:05:09Z

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

rifelpet · 2025-01-09T12:59:19Z

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

Because updating both the cluster's control plane launch templates (or other cloud provider equivalents) and node launch templates at the same time will cause new nodes to fail to join the cluster until all control plane instances have been upgraded. So if Cluster Autoscaler or Karpenter scale up nodes before or during the control plane rolling-update, they will fail to join and workloads will be stuck in Pending. This is almost certainly not what the user wants and is why we're introducing the new command.

rifelpet · 2025-01-09T13:01:56Z

We could allow the user to bypass the error if they know what they're doing. for example, on clusters that dont use Cluster Autoscaler or Karpenter.

stl-victor-sudakov · 2025-01-09T14:07:06Z

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes?

Why would this be an error?

Because updating both the cluster's control plane launch templates (or other cloud provider equivalents) and node launch templates at the same time will cause new nodes to fail to join the cluster until all control plane instances have been upgraded. So if Cluster Autoscaler or Karpenter scale up nodes before or during the control plane rolling-update, they will fail to join and workloads will be stuck in Pending. This is almost certainly not what the user wants and is why we're introducing the new command.

Hold on, hasn't this sequence:

kops upgrade cluster $NAME --yes
kops update cluster $NAME --yes
kops rolling-update cluster $NAME --yes

always been the standard upgrade sequence? These steps are even documented in https://kops.sigs.k8s.io/operations/updates_and_upgrades/#automated-update And now it is an error?

rifelpet · 2025-01-09T14:53:31Z

Now it may cause node failures during the k8s 1.31 upgrade, yes. Hence the bold release note being added in this PR and my proposal to prevent users from making this mistake by returning a (skippable) error.

I'll update that docs page to note this change as well.

stl-victor-sudakov · 2025-01-09T15:45:21Z

Now it may cause node failures during the k8s 1.31 upgrade, yes. Hence the bold release note being added in this PR and my proposal to prevent users from making this mistake by returning a (skippable) error.

I'll update that docs page to note this change as well.

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

rifelpet · 2025-01-09T16:03:13Z

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

I updated this PR to link to the k/k issue that goes into more detail: kubernetes/kubernetes#127316

stl-victor-sudakov · 2025-01-10T05:51:37Z

Sorry for my persistence, what has changed in k8s 1.31 that the regular kOps upgrade procedure has become dangerous?

I updated this PR to link to the k/k issue that goes into more detail: kubernetes/kubernetes#127316

Oh, what a longread! Maybe #16907 would be shorter and more to the point, it is also mentioned within the longer post.

However I think I understand the innovation now. The new reconcile command does "update --yes && rolling-update --yes" on CP nodes first, and then does the same on worker nodes, thus enforcing that the CP is fully updated first. Is this correct?

danports · 2025-01-10T15:08:07Z

The new reconcile command does "update --yes && rolling-update --yes" on CP nodes first, and then does the same on worker nodes, thus enforcing that the CP is fully updated first. Is this correct?

Yes, that's correct.

danports · 2025-01-10T15:12:54Z

docs/tutorial/upgrading-kubernetes.md

@@ -1,15 +1,59 @@
 # Upgrading kubernetes

+## **NOTE for Kubernetes >1.31**
+
+Kops' upgrade procedure has hostorically risked violating the [Kubelet version skew policy](https://kubernetes.io/releases/version-skew-policy/#kubelet). Between `kops update cluster --yes` and every kube-apiserver being rotated with `kops rolling-update cluster --yes`, newly launched nodes running new kubelet versions could be connecting to older `kube-apiserver` nodes.


Suggested change

Kops' upgrade procedure has hostorically risked violating the [Kubelet version skew policy](https://kubernetes.io/releases/version-skew-policy/#kubelet). Between `kops update cluster --yes` and every kube-apiserver being rotated with `kops rolling-update cluster --yes`, newly launched nodes running new kubelet versions could be connecting to older `kube-apiserver` nodes.

Kops' upgrade procedure has historically risked violating the [Kubelet version skew policy](https://kubernetes.io/releases/version-skew-policy/#kubelet). After `kops update cluster --yes` completes and before every kube-apiserver is replaced with `kops rolling-update cluster --yes`, newly launched nodes running newer kubelet versions could be connecting to older `kube-apiserver` nodes.

danports · 2025-01-10T15:15:10Z

docs/tutorial/upgrading-kubernetes.md

+2. `kops rolling-update cluster --instance-group-roles=control-plane,apiserver --yes`
+3. `kops update cluster --yes`
+4. `kops rolling-update cluster --yes`
+


Suggested change

5. `kops update cluster --prune --yes`

danports · 2025-01-10T15:17:34Z

docs/tutorial/upgrading-kubernetes.md


-Upgrading kubernetes is similar to changing the image on an InstanceGroup, except that the kubernetes version is
+Upgrading kubernetes is similar to changing the image on an InstanceGroup, the kubernetes version is


I think the language here was clearer with "except that".

k8s-ci-robot · 2025-01-10T15:18:09Z

@danports: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

danports · 2025-01-10T15:27:24Z

Do we update all the rolling-update cluster docs references to reconcile cluster?

Yes, eventually, but I don't think it needs to happen right away, since reconcile cluster probably needs more time to mature and should support additional options that update cluster/rolling-update cluster already support. See #17146 for instance.

Do we return an error when a user tries to upgrade from k8s 1.30 to 1.31 using update cluster --yes? (with no --instance-group* filtering)

That would be a smart idea, though perhaps it should only be a warning if the cluster doesn't have CAS/Karpenter enabled.

Do we add a new permalink that the error message links to?

👍 More context for error messages is always good.

Do we mention the new update cluster --reconcile flag? When would a user use it instead of kops reconcile cluster ?

I am confused about this myself. Based on the commit history I think maybe @justinsb just added the --reconcile flag as an initial step before adding the full reconcile cluster command, so maybe we don't need --reconcile anymore?

k8s-ci-robot · 2025-01-13T21:37:57Z

@rifelpet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kops-e2e-gce-cni-calico	`ae625a6`	link	true	`/test pull-kops-e2e-gce-cni-calico`
pull-kops-e2e-gce-cni-kindnet	`ae625a6`	link	true	`/test pull-kops-e2e-gce-cni-kindnet`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2025

k8s-ci-robot requested a review from hakman January 9, 2025 02:39

k8s-ci-robot added the area/documentation label Jan 9, 2025

k8s-ci-robot requested a review from johngmyers January 9, 2025 02:39

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 9, 2025

Add docs for the new kops reconcile cluster command

46ea30d

rifelpet force-pushed the reconcile-docs branch from e1ff234 to 46ea30d Compare January 9, 2025 03:00

Add more details and update updates_and_upgrades.md

ae625a6

danports suggested changes Jan 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs for the new kops reconcile cluster command #17191

Add docs for the new kops reconcile cluster command #17191

rifelpet commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 10, 2025 •

edited

Loading

danports commented Jan 10, 2025

danports Jan 10, 2025

danports Jan 10, 2025

danports Jan 10, 2025

k8s-ci-robot commented Jan 10, 2025

danports commented Jan 10, 2025

k8s-ci-robot commented Jan 13, 2025

	Kops' upgrade procedure has hostorically risked violating the [Kubelet version skew policy](https://kubernetes.io/releases/version-skew-policy/#kubelet). Between `kops update cluster --yes` and every kube-apiserver being rotated with `kops rolling-update cluster --yes`, newly launched nodes running new kubelet versions could be connecting to older `kube-apiserver` nodes.
	Kops' upgrade procedure has historically risked violating the [Kubelet version skew policy](https://kubernetes.io/releases/version-skew-policy/#kubelet). After `kops update cluster --yes` completes and before every kube-apiserver is replaced with `kops rolling-update cluster --yes`, newly launched nodes running newer kubelet versions could be connecting to older `kube-apiserver` nodes.


		Upgrading kubernetes is similar to changing the image on an InstanceGroup, except that the kubernetes version is
		Upgrading kubernetes is similar to changing the image on an InstanceGroup, the kubernetes version is

Add docs for the new kops reconcile cluster command #17191

Are you sure you want to change the base?

Add docs for the new kops reconcile cluster command #17191

Conversation

rifelpet commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 9, 2025

rifelpet commented Jan 9, 2025

stl-victor-sudakov commented Jan 10, 2025 • edited Loading

danports commented Jan 10, 2025

danports Jan 10, 2025

Choose a reason for hiding this comment

danports Jan 10, 2025

Choose a reason for hiding this comment

danports Jan 10, 2025

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 10, 2025

danports commented Jan 10, 2025

k8s-ci-robot commented Jan 13, 2025

stl-victor-sudakov commented Jan 10, 2025 •

edited

Loading