node-controller: Support an annotation to hold updates #2163

cgwalters · 2020-10-15T16:41:44Z

Today the MCO arbitrarily chooses a node to update from the candidates.
We want to allow admins to avoid specific nodes entirely.

(Aside: This replaces the defunct etcd-specific ordering code that
we didn't end up using)

Add an annotation machineconfiguration.openshift.io/hold that allows
an external controller (and/or human) to avoid specific nodes.

Related: #2059

Today the MCO arbitrarily chooses a node to update from the candidates. We want to allow admins to avoid specific nodes entirely. (Aside: This replaces the defunct etcd-specific ordering code that we didn't end up using) Add an annotation `machineconfiguration.openshift.io/hold` that allows an external controller (and/or human) to avoid specific nodes. Related: openshift#2059

cgwalters · 2020-10-15T16:41:58Z

Only compile tested locally.

openshift-ci-robot · 2020-10-15T16:42:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

crawford · 2020-10-15T17:16:38Z

High-level direction looks good to me. Can you split this into multiple commits since this is also pulling out unrelated functionality? This will need an update to the docs.

How does an admin know that machines in the cluster are held? What does the MCO's ClusterOperator object look like when a machine is in the held state? We'll need to make it easy for an admin to know that machines have been held, otherwise people are going to forget and it's going to cause problems with upgrades.

kikisdeliveryservice · 2020-10-15T19:22:25Z

How does an admin know that machines in the cluster are held? What does the MCO's ClusterOperator object look like when a machine is in the held state? We'll need to make it easy for an admin to know that machines have been held, otherwise people are going to forget and it's going to cause problems with upgrades.

I actually think pausing nodes indefinitely (and allowing them to skip and update) undermines the point of pools entirely (whereas specific ordering does not). If the point of a pool is knowing that all members of the pool have a certain rendered-config, skipping an update makes the pool status and information meaningless. If you can do that what does it mean to be a member of a pool? If a node doesn't want the updates given to a specific pool why is it a member?

As a UI, the MCO thinks in pools for its status, how can we reflect a pool as upgraded if all members don't have the same config? In the long-term how do we troubleshoot in a general sense when the state of a node is not longer easily known? If there’s an mco issue, I have to dig thru annotations+rendered configs? What if a node is paused on a required update path?

If the hold is indefinite (and not just a hold this node for last or something) I think we really do start to hit some fundamental questions/concerns and some long-term confusion/debugging risks.

Note: If I'm misunderstanding what the intention of "hold" is please let me know =)

kikisdeliveryservice · 2020-10-15T19:25:59Z

Also adding a hold (:drum: ) since the MCO team has some questions/concerns that need to be addressed and would also like more background/use case info.

/hold

cgwalters · 2020-10-15T20:40:55Z

If the point of a pool is knowing that all members of the pool have a certain rendered-config, skipping an update makes the pool status and information meaningless.

This doesn't skip an update - the node isn't updated and the pool wouldn't "skip" the node. The node would still count as not updated.

The pool would progress until the only nodes remaining are "held" and at that point I think the status should be the same as if the pool was paused.

kikisdeliveryservice · 2020-10-15T20:50:51Z

The pool would progress until the only nodes remaining are "held" and at that point I think the status should be the same as if the pool was paused.

I guess this brings me to the question/possible solution of: If we're holding multiple nodes of a pool, why not designate them into a custom pool to preserve the UI/general sense of the MCO dealing in pools? Is there something preventing that or some problem with custom pools that we could address?

I'm starting to compile a list of our questions bc I feel like there's some color we are missing in this.

cgwalters · 2020-10-15T20:56:14Z

If we're holding multiple nodes of a pool, why not designate them into a custom pool to preserve the UI/general sense of the MCO dealing in pools? Is there something preventing that or some problem with custom pools that we could address?

In some cases one might want this for the control plane nodes, can't change the pool for those
Switching the pool a node is in currently causes a drain/reboot (because the rendered MC name changes, we might be able to fix that)
A number of cases of this would be avoiding one or two nodes, making a pool for just one node is awkward at best, and kind of a "permanent" situation where this is more of a transient thing IMO

cgwalters · 2020-10-16T19:48:34Z

(Will circle back to this on Monday but) To expand on the logic here, we're just changing the node controller's getCandidateMachines().

Nothing else calls that function - in particular when we go to calculate the pool status, notice:

func calculateStatus(pool *mcfgv1.MachineConfigPool, nodes []*corev1.Node) mcfgv1.MachineConfigPoolStatus {
	machineCount := int32(len(nodes))

	updatedMachines := getUpdatedMachines(pool.Spec.Configuration.Name, nodes)
	updatedMachineCount := int32(len(updatedMachines))

	readyMachines := getReadyMachines(pool.Spec.Configuration.Name, nodes)
	readyMachineCount := int32(len(readyMachines))

	unavailableMachines := getUnavailableMachines(nodes)
	unavailableMachineCount := int32(len(unavailableMachines))

...

	allUpdated := updatedMachineCount == machineCount &&
		readyMachineCount == machineCount &&
		unavailableMachineCount == 0

Now we do need to patch this logic to probably add a "held node" count to the pool status, but the important thing to notice here is that held machines don't count as updated, so the pool won't count as updated until they're un-held. Another way to look at this is that holding a machine is much like marking it unscheduable - the node is still there and counts.

yuqi-zhang · 2020-10-20T18:36:30Z

So here's a general concern I have for this: we really only should allow one version to be updated to at a time, and since we don't have the concept of queue'd updates, paused nodes should block the NEXT update because we can't guarantee edges in the future, somewhat invalidating this path.

Let me give an example:

Today let's say your worker nodes (named A, B, C, D) are on rendered-1, and you add a MC to create rendered-2. Let's say A starts the update first. While A is updating you add a new MC that causes rendered-3. The worker pool will target that immediately, so A, after it finishes the update, will go to rendered 3, BUT because BCD has not yet started they will skip the inbetween step.

Basically:
A: rendered-1 -> rendered-2 -> rendered-3
BCD: rendered-1 -> rendered-3

At a glance that might seem like behaviour we want, which is fine-ish for our regular operations. But now lets add pausing to the picture, and use more concrete examples with the rendered version.

Same nodes, installed on 4.4, and now you're upgrading to 4.5. Lets say you pause D, and ABC finishes the upgrade without a problem. And now you have
ABC: 4.5
D: 4.4

Now you want to start another upgrade to 4.6. There is no edge from 4.4->4.6. ABC, although on 4.5 and can upgrade, should not. Otherwise we're going to leave D in a limbo where it can never be un-held or risk failing the update. We also don't have a queue to tell D to eventually go from 4.4->4.5->4.6 so held nodes would forever block updates until there are no held nodes from the previous version, which somewhat kills the point of node holds really because you can't drift more than 1 version.

Now of course we can add a lot of complex logic here, edge checking, or queued updates, or something, but I think that if we do go down that path I wouldn't recommend a quick and dirty annotation like this, but a more fleshed out design.

Maybe my assessment is not entirely correct but just wanted to add this for discussion.

yuqi-zhang · 2020-10-20T18:41:30Z

I guess my comment doesn't really apply to this PR specifically very well. It's more so to the use case that some nodes getting paused will block all updates forever, I feel we may want a better solution in that case.

cgwalters · 2020-10-20T18:43:22Z

This discussion is strongly related to the "worker pool blocking cluster updates across major versions" thread that came up (not sure if it's on GH or just in the internal jira dark matter).

Remember today, the worker pool does not block upgrades - we have had in the past (and will continue to) have potential failure scenarios where a worker is stuck on e.g. 4.4 becuase of some failure to drain and the control plane is all the way on 4.6 and we end up trying to jump the worker (if the drain gets unstuck) straight from 4.4 to 4.6.

Or the admin could have just straight up paused the worker pool and forgotten about it.

IOW this capability is not introducing a new possible failure mode. (I would probably agree it makes it more likely though)

And yes I think the MCO should learn about major version updates vs minor.

yuqi-zhang · 2020-10-20T18:52:38Z

I agree, I'm not against this PR, just that "pausing" seems like a paradigm we should not be supporting for extended lengths of time. You can pause a few nodes for a short time while they're running critical workloads, but you should not be holding them for weeks/months to drift the cluster.

cgwalters · 2020-10-20T19:37:32Z

but you should not be holding them for weeks/months to drift the cluster.

I think it's a bit more nuanced than that. Remember Red Hat hasn't finished the acquisition of CoreOS so it's only drift if the admin starts an oc adm upgrade. Or to say this another way, a worker node sitting there for months in hold isn't a problem if the control plane isn't upgrading either and the admin is OK with the tradeoff of not applying node security updates. (In some cases admins may wave their hands and say that there's no untrusted workloads, or in other cases those workloads are all isolated as VMs, etc.)

crawford · 2020-10-29T23:47:48Z

Catching up a bit...

@kikisdeliveryservice I completely agree with your argument and I think @yuqi-zhang does a good job of spelling out just how nasty things can get. The tentative agreement is that the customers who make use of this feature will enjoy the responsibility of ensuring that the myriad versions of RHCOS that make up their cluster and the updates they apply remain a subgraph of the published Cincinnati graph. The overwhelming majority of cluster administrators will not be able to effectively follow these constraints, so in practice, this holding mechanism will only be useful for temporary actions on a machine (e.g. debugging) and really should not be used across updates. Come to think of it, we should include telemetry and an alerting rule so that administrators are reminded when a machine remains paused for too long.

openshift-merge-robot · 2020-11-06T00:31:58Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/okd-e2e-aws	`9bd2397`	link	`/test okd-e2e-aws`
ci/prow/e2e-aws-serial	`9bd2397`	link	`/test e2e-aws-serial`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters · 2020-11-11T23:31:18Z

and the updates they apply remain a subgraph of the published Cincinnati graph.

Note that today the worker pool isn't gating, so arbitrary worker node skew can and definitely does happen. IOW there's no relationship today between Cincinnati and the worker pool. There were discussions about fixing that so that the MCO wouldn't allow upgrading the control plane across majors while workers were at N-1 but it doesn't exist today.

I'd agree explicitly supporting it this way makes it more likely for a few nodes to skew, but failure to drain can also cause skew (and that's come up in a few bug reports).

openshift-bot · 2021-02-16T23:40:44Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

geoberle · 2021-02-23T20:23:20Z

I agree, I'm not against this PR, just that "pausing" seems like a paradigm we should not be supporting for extended lengths of time. You can pause a few nodes for a short time while they're running critical workloads, but you should not be holding them for weeks/months to drift the cluster.

i know this is mostly a technical discussion between peers but I hope you allow me to bring in a cluster admin perspective

We have exactly the situation that our users start critical jobs that run for several hours. We would like not to interrupt them. On openshift 4 it is not apparent how to achieve that. Having multiple machine pools for upgrading control via pause was an initial idea of ours but felt a bit like misuse so we discarded it.

Holding those few nodes and taint them for new pods at the beginning of a cluster upgrade sounds manageable in our case. Since critical workloads are properly labeled, we would automate the hold/unhold actions during cluster upgrades or machineconfig changes.

Since the risk for skew was discussed here as well: node drains are not always reliable so looking out for blocking pods and incomplete updates on pools is something required as part of day 2 operations anyways.

cgwalters · 2021-02-23T20:29:18Z

i know this is mostly a technical discussion between peers but I hope you allow me to bring in a cluster admin perspective

That's fine, our discussions are public partially for this reason!

We have exactly the situation that our users start critical jobs that run for several hours. We would like not to interrupt them. On openshift 4 it is not apparent how to achieve that.

Broadly speaking, https://kubernetes.io/docs/tasks/run-application/configure-pdb/

openshift-bot · 2021-03-25T21:04:45Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

cgwalters · 2021-04-09T17:43:59Z

/remove-lifecycle rotten
came up again today

openshift-ci · 2021-05-21T15:59:07Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-serial	`9bd2397`	link	`/test e2e-aws-serial`
ci/prow/e2e-agnostic-upgrade	`9bd2397`	link	`/test e2e-agnostic-upgrade`
ci/prow/images	`9bd2397`	link	`/test images`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters · 2021-05-25T19:44:19Z

We're having a big debate on a related issue in coreos/zincati#245
and one thing I recently realized is that systemd has a nice API to block reboots.

I think the MCO should honor systemd-inhibit --mode=block - basically if an admin used ssh to add a node reboot blocker via systemd, then the MCD should reflect that state into the node via annotation, and then the node controller knows to automatically avoid it.

Doing this via systemd has a lot of nice properties:

We're not inventing a new Kubernetes/OpenShift specific mechanism
Conversely said, it would also correctly block accidental reboot invocations via ssh/oc debug node
It matches the recent kubelet use of systemd delay inhibitors https://kubernetes.io/blog/2021/04/21/graceful-node-shutdown-beta/

openshift-ci · 2021-05-25T19:44:25Z

@cgwalters: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2021-08-23T20:54:49Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-09-23T02:46:04Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-10-23T03:16:51Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-10-23T03:17:14Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from ashcrow and kikisdeliveryservice October 15, 2020 16:42

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2020

cgwalters mentioned this pull request Oct 15, 2020

node-controller: Support an annotation to hold/prioritize updates #2162

Closed

ashcrow requested review from runcom and removed request for ashcrow October 15, 2020 19:22

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 15, 2020

kikisdeliveryservice added the team-mco label Nov 18, 2020

This was referenced Dec 3, 2020

Allow administrators to guide upgrade ordering #2059

Open

Supporting out-of-tree drivers on OpenShift openshift/enhancements#357

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 25, 2021

openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 9, 2021

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 25, 2021

cgwalters mentioned this pull request May 25, 2021

updates: new strategy based on local filesystem coreos/zincati#245

Open

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2021

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 23, 2021

openshift-ci bot closed this Oct 23, 2021

cgwalters mentioned this pull request Mar 21, 2022

Add enhancement proposal for SDN live migration openshift/enhancements#1064

Merged

cgwalters mentioned this pull request Sep 8, 2022

layering: pod-like MachineImage idea #3327

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node-controller: Support an annotation to hold updates #2163

node-controller: Support an annotation to hold updates #2163

cgwalters commented Oct 15, 2020

cgwalters commented Oct 15, 2020

openshift-ci-robot commented Oct 15, 2020

crawford commented Oct 15, 2020

kikisdeliveryservice commented Oct 15, 2020 •

edited

Loading

kikisdeliveryservice commented Oct 15, 2020 •

edited

Loading

cgwalters commented Oct 15, 2020 •

edited

Loading

kikisdeliveryservice commented Oct 15, 2020

cgwalters commented Oct 15, 2020

cgwalters commented Oct 16, 2020

yuqi-zhang commented Oct 20, 2020 •

edited

Loading

yuqi-zhang commented Oct 20, 2020

cgwalters commented Oct 20, 2020

yuqi-zhang commented Oct 20, 2020

cgwalters commented Oct 20, 2020

crawford commented Oct 29, 2020

openshift-merge-robot commented Nov 6, 2020

cgwalters commented Nov 11, 2020

openshift-bot commented Feb 16, 2021

geoberle commented Feb 23, 2021

cgwalters commented Feb 23, 2021

openshift-bot commented Mar 25, 2021

cgwalters commented Apr 9, 2021

openshift-ci bot commented May 21, 2021

cgwalters commented May 25, 2021

openshift-ci bot commented May 25, 2021

openshift-bot commented Aug 23, 2021

openshift-bot commented Sep 23, 2021

openshift-bot commented Oct 23, 2021

openshift-ci bot commented Oct 23, 2021

node-controller: Support an annotation to hold updates #2163

node-controller: Support an annotation to hold updates #2163

Conversation

cgwalters commented Oct 15, 2020

cgwalters commented Oct 15, 2020

openshift-ci-robot commented Oct 15, 2020

crawford commented Oct 15, 2020

kikisdeliveryservice commented Oct 15, 2020 • edited Loading

kikisdeliveryservice commented Oct 15, 2020 • edited Loading

cgwalters commented Oct 15, 2020 • edited Loading

kikisdeliveryservice commented Oct 15, 2020

cgwalters commented Oct 15, 2020

cgwalters commented Oct 16, 2020

yuqi-zhang commented Oct 20, 2020 • edited Loading

yuqi-zhang commented Oct 20, 2020

cgwalters commented Oct 20, 2020

yuqi-zhang commented Oct 20, 2020

cgwalters commented Oct 20, 2020

crawford commented Oct 29, 2020

openshift-merge-robot commented Nov 6, 2020

cgwalters commented Nov 11, 2020

openshift-bot commented Feb 16, 2021

geoberle commented Feb 23, 2021

cgwalters commented Feb 23, 2021

openshift-bot commented Mar 25, 2021

cgwalters commented Apr 9, 2021

openshift-ci bot commented May 21, 2021

cgwalters commented May 25, 2021

openshift-ci bot commented May 25, 2021

openshift-bot commented Aug 23, 2021

openshift-bot commented Sep 23, 2021

openshift-bot commented Oct 23, 2021

openshift-ci bot commented Oct 23, 2021

kikisdeliveryservice commented Oct 15, 2020 •

edited

Loading

kikisdeliveryservice commented Oct 15, 2020 •

edited

Loading

cgwalters commented Oct 15, 2020 •

edited

Loading

yuqi-zhang commented Oct 20, 2020 •

edited

Loading