Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 MHC marks machines for remediation with conditions #3108

Merged

Conversation

benmoss
Copy link

@benmoss benmoss commented May 28, 2020

What this PR does / why we need it:
Moves MHC controller and MachineSet to use conditions instead of MHC deleting them directly. This is the first part of #2976

@vincepri @fabriziopandini

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 28, 2020
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 28, 2020
@benmoss benmoss force-pushed the machineset-remediations branch 5 times, most recently from 4f64419 to c45798b Compare May 28, 2020 21:52
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benmoss for this PR! Great to see conditions in action!

api/v1alpha3/machine_types.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_controller.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_targets.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_targets.go Outdated Show resolved Hide resolved
controllers/machinehealthcheck_targets.go Outdated Show resolved Hide resolved
controllers/machineset_controller.go Outdated Show resolved Hide resolved
controllers/machineset_controller.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 1, 2020
@vincepri
Copy link
Member

vincepri commented Jun 2, 2020

/retitle 🌱 MHC marks machines for remediation with conditions

@k8s-ci-robot k8s-ci-robot changed the title 🏃 MHC marks machines for remediation with conditions 🌱 MHC marks machines for remediation with conditions Jun 2, 2020
@vincepri
Copy link
Member

vincepri commented Jun 3, 2020

@benmoss What's the status of this PR?

@benmoss
Copy link
Author

benmoss commented Jun 3, 2020

It's waiting for approval and lgtm

@vincepri
Copy link
Member

vincepri commented Jun 3, 2020

Let's sync up with @fabriziopandini on where to move and how to name conditions constants, there is also the other comment I made about moving the tests in the other package without the need to create a new envtest.

@benmoss
Copy link
Author

benmoss commented Jun 3, 2020

Yeah, I still disagree with that comment. The controllers tests are a mess in part because of the mix of envtest and unit tests going on there. I think we should keep full integration tests like these separate.

@vincepri
Copy link
Member

vincepri commented Jun 3, 2020

I'd like to better understand the benefit of having multiple separate suites. There is some downsides I can see with this approach if it grows past a certain point, for example getting a new envtest up takes quite a bit of time and having multiple of them can greatly increase the job's time.

Envtest doesn't come for free either, on my machine which isn't affected by other running jobs (like in prow) spinning up etcd and api-server takes 3-400% CPU on start and it settles betwen 2-300% during tests; we can potentially cause flakes or other resource-starvation problems in prow by creating more than necessary.

@benmoss
Copy link
Author

benmoss commented Jun 3, 2020

So for one by marking these as a separate suite we can address the resource consumption issues. I've been running this test using the USE_EXISTING_CLUSTER environment variable with this test and it works and cleans up all the resources that it creates. However other controller tests don't seem to behave nicely and fail when run with this flag, I haven't looked into why, but probably there just isn't any cleanup logic in them. But on CI we could use an external cluster for these, rather than something that has to share resources with prow.

But even still the broader point is that these are slow tests and are fully integrated against the apiserver and etcd. They are a different kind of test, so they should be represented that way. I'm fine to move this into a separate issue. I tried moving this test into the MachineSet test suite but I think I have to convert it into a Ginkgo test for it to work.

@vincepri
Copy link
Member

vincepri commented Jun 3, 2020

I tried moving this test into the MachineSet test suite but I think I have to convert it into a Ginkgo test for it to work.

In machinehealthcheck_controller_test.go, we can add these tests under var _ = Describe("MachineHealthCheck Reconciler", func() { or, probably even better, create a new top level Describe.

@benmoss benmoss force-pushed the machineset-remediations branch 4 times, most recently from ab5449e to 53aed25 Compare June 3, 2020 18:19
@benmoss
Copy link
Author

benmoss commented Jun 3, 2020

This looks like a failure due to the MHC controller and the MachineSet controller clobbering each other's condition patches and the MHC controller ultimately winning. Not sure why we weren't seeing it before.

/hold

@benmoss
Copy link
Author

benmoss commented Jun 3, 2020

Just out of curiosity if it passes, it seems to not fail that often on my machine but maybe it's related to the number of cores or something.

/test pull-cluster-api-test

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 6, 2020
@vincepri
Copy link
Member

vincepri commented Jun 8, 2020

@benmoss Let's try to get this merged today if possible

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 8, 2020
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 8, 2020
api/v1alpha3/condition_consts.go Outdated Show resolved Hide resolved
api/v1alpha3/condition_consts.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 9, 2020
@vincepri
Copy link
Member

vincepri commented Jun 9, 2020

LGTM pending squash

@benmoss
Copy link
Author

benmoss commented Jun 10, 2020

@vincepri it has been squished

@vincepri
Copy link
Member

@benmoss mind rebasing on master if you haven't already? Seems some of the tests are failing.

- Implements MachineSet as the only controller that observes this
condition
@benmoss
Copy link
Author

benmoss commented Jun 10, 2020

Seeing this failure even after rebase

• Failure [10.128 seconds]
Patch Helper
/home/mossity/workspace/cluster-api/util/patch/patch_test.go:38
  Should patch conditions
  /home/mossity/workspace/cluster-api/util/patch/patch_test.go:101
    on a clusterv1.Cluster object
    /home/mossity/workspace/cluster-api/util/patch/patch_test.go:149
      should recover if there is a resolvable conflict, incl. patch spec and status [It]
      /home/mossity/workspace/cluster-api/util/patch/patch_test.go:226

      Timed out after 10.000s.
      Expected
          <bool>: false
      to be true

      /home/mossity/workspace/cluster-api/util/patch/patch_test.go:270

opened #3178

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 10, 2020

@benmoss: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-apidiff 2c77beb link /test pull-cluster-api-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@@ -29,7 +29,7 @@ const (
// Conditions and condition Reasons for the Machine object

// MachineSummaryConditionsCount defines the total number of conditions on the Machine object.
const MachineSummaryConditionsCount = 2
const MachineSummaryConditionsCount = 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this can probably be removed now, but we can do in a follow-up

cc @fabriziopandini

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 10, 2020
@k8s-ci-robot k8s-ci-robot merged commit 650a357 into kubernetes-sigs:master Jun 10, 2020
@benmoss benmoss deleted the machineset-remediations branch June 11, 2020 13:41
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants