Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add support for KCP remediation during cluster provisioning #7963

Merged

Conversation

fabriziopandini
Copy link
Member

What this PR does / why we need it:
This PR proposes an amendment to KCP remediation so it will be possible to remediate failure happening while provisioning the CP (both first CP and CP machines while current replica < desired replica); It also introduces opt-in control of remediation retry behavior.

#7855 is the amendment to the KCP proposal describing those changes

Which issue(s) this PR fixes:
Fixes #7496

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 20, 2023
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First set of comments

test/e2e/kcp_remediations.go Outdated Show resolved Hide resolved
test/e2e/kcp_remediations.go Outdated Show resolved Hide resolved
test/e2e/kcp_remediations.go Show resolved Hide resolved
test/e2e/kcp_remediations.go Show resolved Hide resolved
test/e2e/kcp_remediations.go Outdated Show resolved Hide resolved
test/e2e/kcp_remediations.go Outdated Show resolved Hide resolved
test/e2e/kcp_remediations.go Outdated Show resolved Hide resolved
test/e2e/kcp_remediations.go Outdated Show resolved Hide resolved
Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the review spam :)

Now reviewed everything except the remediation_test.go (anticipating some changes there based on prod-code changes)

Great work on the e2e test!!

Wondering if it makes sense to move the e2e test into a separate PR that we can merge through independently.
I guess most of the test should work against main as well (of course not remediating the first machine).

Might just be nice to keep both PRs and reviews more focused and to have some sort of "remediation worked before the change and after" confirmation.

test/e2e/kcp_remediations.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 31, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 3, 2023
@fabriziopandini fabriziopandini changed the title [WIP] ✨ Add support for KCP remediation during cluster provisioning ✨ Add support for KCP remediation during cluster provisioning Feb 3, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 3, 2023
@fabriziopandini
Copy link
Member Author

@vincepri @sbueringer
Thanks for the great feedback, this PR should be ready for a final pass now!

Please note that:

  • Now last remediation data is stored on machines, and this is the source of truth that KCP uses to take decisions and to surface the last remediation on kcp.status (kcp.status is not used anymore for taking decisions).
  • The change made it possible to have a remediation history for each machine, thus handling also the use case of failures flickering from one machine to another.
  • WRT to defaults for the new API fields, I'm proposing to use current behavior as a default since we don't have demand or a set of trusted data to use for picking up other values.

@fabriziopandini
Copy link
Member Author

/retest

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-e2e-full-main

@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-full-main

@sbueringer
Copy link
Member

@fabriziopandini I'm pretty sure the e2e test doesn't work on Linux. I'm working on a fix which works at least on my Machine. I'll open a PR to verify if it works in Prow as well.

@sbueringer
Copy link
Member

PR is open: #8075

Let's see if that works. I also improved the token creation.

Copy link
Member

@sbueringer sbueringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first part of next round of review
(everything except internal/controllers)

docs/book/src/reference/labels_and_annotations.md Outdated Show resolved Hide resolved
test/e2e/kcp_remediations.go Outdated Show resolved Hide resolved
test/e2e/kcp_remediations.go Outdated Show resolved Hide resolved
@sbueringer
Copy link
Member

sbueringer commented Feb 7, 2023

@fabriziopandini Opened a PR in your repo (fabriziopandini#55) which merges #8075 into the current PR

@sbueringer
Copy link
Member

sbueringer commented Feb 8, 2023

just fyi, I'll address my own and other findings and then merge the fixes into this PR via #8075

Co-authored-by: sbueringer <buringerst@vmware.com>
@fabriziopandini
Copy link
Member Author

@sbueringer @vincepri @jackfrancis this is ready for a final pass
kudos to @sbueringer for nailing down the work on fabriziopandini#55 and #8075, it really helped to push this to the last mile!

@sbueringer
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 18d61cd0fc14f15b5e8f38bfd4e1c26871e678d8

@sbueringer
Copy link
Member

starting lazy consensus from community meeting, 1 week from today (target Feb 22)

@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-full-main

@sbueringer
Copy link
Member

Unrelated flake
/retest

@sbueringer
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 22, 2023
@sbueringer
Copy link
Member

/retest
the usual unit test flake

@k8s-ci-robot k8s-ci-robot merged commit 11e8e87 into kubernetes-sigs:main Feb 22, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.4 milestone Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KCP Doesn't Remediate Faulty Machines During Cluster Formation
5 participants