Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP triggers rollout after upgrade #8124

Closed
sbueringer opened this issue Feb 16, 2023 · 14 comments · Fixed by #8125
Closed

KCP triggers rollout after upgrade #8124

sbueringer opened this issue Feb 16, 2023 · 14 comments · Fixed by #8125
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@sbueringer
Copy link
Member

sbueringer commented Feb 16, 2023

As described in #8101 our e2e-main test is currently flaky. The test case which is responsible for that is the clusterctl upgrade test (v0.3 => main). The problem is that once we upgrade Cluster API to the version from main, Machines are rolled out. After a bit of investigation we found out that the KCP controller triggers the rollout.

[Spoiler alert] The test is failing roughly since 2 weeks ago when we merged #7772

Some Context:

  • We introduced a new ImagePullPolicy field in KubeadmConfigSpec.InitConfiguration.NodeRegistration and KubeadmConfigSpec.JoinConfiguration.NodeRegistration
  • The ImagePullPolicy field has a default value (IfNotPresent) which is applied via OpenAPI
  • The defaulting happens automatically as soon as there is any kind of update on KCP and KubeadmConfig objects
  • When KCP checks if it needs to rollout a machine it also compares the KubeadmConfigSpec in KCP with the KubeadmConfig of existing Machines

Example error case:

  • We install Cluster API v0.3
  • We create the cluster including the KCP object
  • Cluster API is upgraded (to the "main" version)
  • ImagePullPolicy is defaulted in the KubeadmConfig of the Machine. It is not defaulted on the KCP object.
    • Because there was an update on the Machine which triggered defaulting, but not on the KCP object.
  • KCP controller reconcile detects a diff and triggers a rollout
    • Diff is: InitConfiguration.NodeRegistration.ImagePullPolicy: IfNotPresent != ""

The solution is to implement defaulting in a way that it doesn't make a difference when we calculate if we need a rollout or not:

  • Move the defaulting from OpenAPI to the default func: 998f5f3
    • Context: the default func is called on both KCP and KubeadmConfig before diffing

P.S. This bug doesn't affect any released version of Cluster API as #7772 was just merged on main.

/kind bug
/area control-plane
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/control-plane Issues or PRs related to control-plane lifecycle management needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 16, 2023
@sbueringer
Copy link
Member Author

@sbueringer
Copy link
Member Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 16, 2023
@ykakarap
Copy link
Contributor

Thanks for the analysis 🙏🏼 .
This is similar to when we had unexpected rollouts in KCP after adding defaulting to format. Old issue.

We should consider adding a linter check to enforce that we don't accidentally add defaulting using OpenAPI in KCP to avoid have the same problem again in the future.

@sbueringer
Copy link
Member Author

@ykakarap Great idea. I'll open a follow-up issue for the linter.

@vincepri
Copy link
Member

If I understood this correctly, the issue happens because the diffing logic in KCP does not take into account the OpenAI defaulting logic?

@ykakarap
Copy link
Contributor

If I understood this correctly, the issue happens because the diffing logic in KCP does not take into account the OpenAI defaulting logic?

I believe you mean OpenAPI 😅. Yes. KCP logic runs the defaulting webhook code but that does not take care of applying any of the OpenAPI defaulting.

@sbueringer
Copy link
Member Author

Yup and then it's just racy when/if defaulting is done on the KCP and KubeadmConfig objects in the apiserver

@vincepri
Copy link
Member

Yeah OpenAPI 😄 , I'm wondering if we can have a way to disable the use for any tree of fields that we compare in code

@sbueringer
Copy link
Member Author

sbueringer commented Feb 17, 2023

Hm I think on our side in CAPI we could have either a linter (based on controller-tools libraries or just parsing the final CRDs) or maybe a unit test.

I guess the most hacky / trivial thing is just to write a small go binary which parses the relevant CRDs and then fails if it finds a default value in a specific sub-tree.

On controller-tools side I could imagine some sort of marker which disallows the default marker for a struct and everything below.

@sbueringer
Copy link
Member Author

/reopen

for follow-ups

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

/reopen

for follow-ups

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Feb 20, 2023
@sbueringer sbueringer self-assigned this Feb 20, 2023
@furkatgofurov7
Copy link
Member

@sbueringer @killianmuldoon thanks for the great effort on fixing the tests, perhaps we can close this as well (assuming follow-up #8139 is also merged)

@sbueringer
Copy link
Member Author

I was referring to additional measures to ensure this doesn't happen again with follow-ups.

But I've created a separate issue for that now: #8147

Makes sense to close this issue to signal that the issue is resolved

/close

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

I was referring to additional measures to ensure this doesn't happen again with follow-ups.

But I've created a separate issue for that now: #8147

Makes sense to close this issue to signal that the issue is resolved

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants