Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1873288: server: Target the spec configuration if we have at least one node #2035

Merged
merged 1 commit into from
Nov 6, 2020

Conversation

cgwalters
Copy link
Member

The CI cluster hit an issue where a pull secret was broken, and
then we hit a deadlock because the MCO failed to drain nodes on
the old config, because other nodes on the old config couldn't
schedule the pod.

It just generally makes sense for new nodes to use the new config;
do so as long as at least one node has successfully joined the
cluster at that config. This way we still avoid breaking
the cluster (and scaleup) with a bad config.

xref: https://bugzilla.redhat.com/show_bug.cgi?id=1873288

@cgwalters
Copy link
Member Author

(only compile tested)

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 27, 2020
@cgwalters
Copy link
Member Author

Also see #1619 that touched on this a bit.

The CI cluster hit an issue where a pull secret was broken, and
then we hit a deadlock because the MCO failed to drain nodes on
the old config, because other nodes on the old config couldn't
schedule the pod.

It just generally makes sense for new nodes to use the new config;
do so as long as at least one node has successfully joined the
cluster at that config.  This way we still avoid breaking
the cluster (and scaleup) with a bad config.
@cgwalters
Copy link
Member Author

It looks like actually in https://bugzilla.redhat.com/show_bug.cgi?id=1873288 we didn't have any nodes successfully on the new config, so this wouldn't have helped - but I think that's mostly bad luck - the new config could have rolled out but we just happened to pick a node that couldn't drain.

Perhaps the other alternative of a newNodesUseSpec bool value that an admin can flip on in these situations is simplest.

@ashcrow ashcrow requested review from runcom and removed request for ashcrow August 31, 2020 13:38
@openshift-ci-robot
Copy link
Contributor

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-ovn-step-registry 4bd204d link /test e2e-ovn-step-registry
ci/prow/e2e-gcp-upgrade 4bd204d link /test e2e-gcp-upgrade
ci/prow/e2e-aws-workers-rhel7 4bd204d link /test e2e-aws-workers-rhel7
ci/prow/e2e-aws 4bd204d link /test e2e-aws
ci/prow/okd-e2e-aws 4bd204d link /test okd-e2e-aws
ci/prow/e2e-gcp-op 4bd204d link /test e2e-gcp-op
ci/prow/e2e-upgrade 4bd204d link /test e2e-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters cgwalters added the 4.7 Work deferred for 4.7 label Sep 30, 2020
@cgwalters
Copy link
Member Author

cgwalters commented Oct 22, 2020

Now this PR also would have greatly mitigated the issue we hit in #2167
I'd like to consider this for 4.7 (probably needs an e2e, a bit annoying to write but doable and we need the excuse to start testing our interaction with machineAPI)..

As is right now, scaling up during an upgrade basically just makes things slower and worse because:

  • node boots
  • does the firstboot pivot
  • joins the cluster
  • node now needs to be targeted by the MCO for upgrade to the new config
  • node will be drained, a new OS update applied and rebooted again

With this instead the flow would be what you'd expect:

  • node boots
  • node does firstboot pivot to the new upgrade target
  • joins the cluster

And now the node is ready to take on workloads that need migration from previously existing nodes.

@runcom runcom changed the title server: Target the spec configuration if we have at least one node Bug 1873288: server: Target the spec configuration if we have at least one node Oct 22, 2020
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 22, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Bugzilla bug 1873288, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1873288: server: Target the spec configuration if we have at least one node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom
Copy link
Member

runcom commented Oct 22, 2020

This, related to the BZ just attached is very much welcome in 4.7 if we can fix the bz and have something that works in those scenarios

@cgwalters
Copy link
Member Author

/retest

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favour of this approach. A use case is e.g. for CI where we dynamically spin up and down new nodes, upgrades could be stalled much longer as all new nodes will have to update (and drain workloads which will take hours). Will do some manual testing

@sinnykumari
Copy link
Contributor

lgtm. Will wait for Jerry's test result.

@yuqi-zhang
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 4, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@alvaroaleman
Copy link

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

23 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit ba0a056 into openshift:master Nov 6, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: All pull requests linked via external trackers have merged:

Bugzilla bug 1873288 has been moved to the MODIFIED state.

In response to this:

Bug 1873288: server: Target the spec configuration if we have at least one node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alvaroaleman
Copy link

@cgwalters @yuqi-zhang any chance this could get backported to 4.6?

@cgwalters
Copy link
Member Author

/cherrypick release-4.6

@openshift-cherrypick-robot

@cgwalters: new pull request created: #2225

In response to this:

/cherrypick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4.7 Work deferred for 4.7 approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants