Bug 1873288: server: Target the spec configuration if we have at least one node #2035

cgwalters · 2020-08-27T23:05:51Z

The CI cluster hit an issue where a pull secret was broken, and
then we hit a deadlock because the MCO failed to drain nodes on
the old config, because other nodes on the old config couldn't
schedule the pod.

It just generally makes sense for new nodes to use the new config;
do so as long as at least one node has successfully joined the
cluster at that config. This way we still avoid breaking
the cluster (and scaleup) with a bad config.

xref: https://bugzilla.redhat.com/show_bug.cgi?id=1873288

cgwalters · 2020-08-27T23:05:58Z

(only compile tested)

cgwalters · 2020-08-27T23:08:40Z

Also see #1619 that touched on this a bit.

The CI cluster hit an issue where a pull secret was broken, and then we hit a deadlock because the MCO failed to drain nodes on the old config, because other nodes on the old config couldn't schedule the pod. It just generally makes sense for new nodes to use the new config; do so as long as at least one node has successfully joined the cluster at that config. This way we still avoid breaking the cluster (and scaleup) with a bad config.

cgwalters · 2020-08-28T13:06:55Z

It looks like actually in https://bugzilla.redhat.com/show_bug.cgi?id=1873288 we didn't have any nodes successfully on the new config, so this wouldn't have helped - but I think that's mostly bad luck - the new config could have rolled out but we just happened to pick a node that couldn't drain.

Perhaps the other alternative of a newNodesUseSpec bool value that an admin can flip on in these situations is simplest.

openshift-ci-robot · 2020-09-14T19:35:41Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-ovn-step-registry	`4bd204d`	link	`/test e2e-ovn-step-registry`
ci/prow/e2e-gcp-upgrade	`4bd204d`	link	`/test e2e-gcp-upgrade`
ci/prow/e2e-aws-workers-rhel7	`4bd204d`	link	`/test e2e-aws-workers-rhel7`
ci/prow/e2e-aws	`4bd204d`	link	`/test e2e-aws`
ci/prow/okd-e2e-aws	`4bd204d`	link	`/test okd-e2e-aws`
ci/prow/e2e-gcp-op	`4bd204d`	link	`/test e2e-gcp-op`
ci/prow/e2e-upgrade	`4bd204d`	link	`/test e2e-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cgwalters · 2020-10-22T01:48:21Z

Now this PR also would have greatly mitigated the issue we hit in #2167
I'd like to consider this for 4.7 (probably needs an e2e, a bit annoying to write but doable and we need the excuse to start testing our interaction with machineAPI)..

As is right now, scaling up during an upgrade basically just makes things slower and worse because:

node boots
does the firstboot pivot
joins the cluster
node now needs to be targeted by the MCO for upgrade to the new config
node will be drained, a new OS update applied and rebooted again

With this instead the flow would be what you'd expect:

node boots
node does firstboot pivot to the new upgrade target
joins the cluster

And now the node is ready to take on workloads that need migration from previously existing nodes.

openshift-ci-robot · 2020-10-22T10:18:50Z

@cgwalters: This pull request references Bugzilla bug 1873288, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1873288: server: Target the spec configuration if we have at least one node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

runcom · 2020-10-22T10:19:31Z

This, related to the BZ just attached is very much welcome in 4.7 if we can fix the bz and have something that works in those scenarios

cgwalters · 2020-10-27T17:41:08Z

/retest

yuqi-zhang

I am in favour of this approach. A use case is e.g. for CI where we dynamically spin up and down new nodes, upgrades could be stalled much longer as all new nodes will have to update (and drain workloads which will take hours). Will do some manual testing

sinnykumari · 2020-11-04T10:16:15Z

lgtm. Will wait for Jerry's test result.

yuqi-zhang · 2020-11-04T23:52:37Z

/lgtm

openshift-ci-robot · 2020-11-04T23:52:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-11-05T01:22:50Z

/retest