Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 test/e2e check for machines being ready after provisioning on Runtime SDK test #8625

Conversation

chrischdi
Copy link
Member

@chrischdi chrischdi commented May 9, 2023

What this PR does / why we need it:

Sometimes when reaching the parts where we wait for the hooks, not all machines are ready. This results in the hook not even getting executed during the 30s period.

Failed test logs did show that it may only be a minute which is missing until everything would succeed.

This PR adds a check to ensure all machines are healthy before continuing.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Workaround for #8486

We want to still try to track it down to see if we have an underlying issue.

cc @killianmuldoon

Note:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 9, 2023
@chrischdi chrischdi changed the title test/e2e check for machines being ready after provisioning on Runtime SDK test 🐛 test/e2e check for machines being ready after provisioning on Runtime SDK test May 9, 2023
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code logic looks ok to me, but the linter is unhappy ;)

@chrischdi chrischdi force-pushed the pr/e2e-runtimesdk-increase-timeout branch from e4144f8 to 4adc1ec Compare May 9, 2023 13:49
@chrischdi
Copy link
Member Author

code logic looks ok to me, but the linter is unhappy ;)

yep, fixed :-)

@chrischdi
Copy link
Member Author

/cherry-pick release-1.4

@k8s-infra-cherrypick-robot

@chrischdi: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chrischdi
Copy link
Member Author

/cherry-pick release-1.3

@k8s-infra-cherrypick-robot

@chrischdi: once the present PR merges, I will cherry-pick it on top of release-1.3 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

test/e2e/cluster_upgrade_runtimesdk.go Show resolved Hide resolved
test/e2e/cluster_upgrade_runtimesdk.go Outdated Show resolved Hide resolved
@@ -169,6 +169,23 @@ func clusterUpgradeWithRuntimeSDKSpec(ctx context.Context, inputGetter func() cl
clusterRef,
input.E2EConfig.GetIntervals(specName, "wait-cluster"))
},
PostMachinesProvisioned: func() {
Eventually(func() error {
machineList := &clusterv1.MachineList{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal here to check that Nodes of all Machines are ready (CP + MD)?

If it's just about MDs wouldn't it be better to check the MD status replica fields?

Copy link
Member Author

@chrischdi chrischdi May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about all nodes of the cluster.

Edit: more correct:

  • KCP should not be IsScaling

cpScaling, err := contract.ControlPlane().IsScaling(s.Current.ControlPlane.Object)

  • No MD should be in RollingOut

if s.Current.MachineDeployments.IsAnyRollingOut() {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically we use node readiness as a proxy for the things the topology controller is actually checking.

Let's please add a godoc comment that we are checking for node healthiness to ensure the cluster is ~ stable so that the upgrade we run below is not delayed by too much through the initial cluster creation.

I think the check is not really the same as MD IsAnyRollingOut as that one is also checking for available replicas which can have some delay depending on MinReadySeconds.

But overall I guess it's good enough, let's just make sure we document clearly why we do it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment and squashed. PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thank you!

Copy link
Contributor

@ykakarap ykakarap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: The topology controller (the controllers that calls the hook) does not depend on the Machines being ready to call the hook. The hook response is also not dependent on the Machines being ready. I am curious how the machines not being ready are blocking the hook from getting called. 🤔

@chrischdi
Copy link
Member Author

Curious: The topology controller (the controllers that calls the hook) does not depend on the Machines being ready to call the hook. The hook response is also not dependent on the Machines being ready. I am curious how the machines not being ready are blocking the hook from getting called. 🤔

We actually check the MD's and postpone calling the hook if a MD is rolling out

if s.Current.MachineDeployments.IsAnyRollingOut() {

Also above that we check if the KCP IsScaling.

@ykakarap
Copy link
Contributor

ykakarap commented May 9, 2023

Ah! Good point.

… SDK test

The topology controller checks if the ControlPlane IsScaling() or any MachineDeployment
IsRollingOut() before running the BeforeClusterUpgrade hook. In some cases in e2e tests
the nodes were not ready yet and the timeout for the BeforeClusterUpgrade hook was not
sufficient for the nodes to get ready and running the hook.
@chrischdi chrischdi force-pushed the pr/e2e-runtimesdk-increase-timeout branch from c9998b8 to 642e16c Compare May 10, 2023 13:04
Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks again for taking this up!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 82259d45f76afe4390de9bb90f4d2095f5a363e7

@sbueringer
Copy link
Member

Thank you very much!

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2023
@k8s-ci-robot k8s-ci-robot merged commit 9278ddd into kubernetes-sigs:main May 12, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.5 milestone May 12, 2023
@k8s-infra-cherrypick-robot

@chrischdi: new pull request created: #8646

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@chrischdi: new pull request created: #8647

In response to this:

/cherry-pick release-1.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@johannesfrey
Copy link
Contributor

/area testing

@k8s-ci-robot k8s-ci-robot added the area/testing Issues or PRs related to testing label Jun 5, 2023
@killianmuldoon
Copy link
Contributor

/area e2e-testing

@k8s-ci-robot k8s-ci-robot added the area/e2e-testing Issues or PRs related to e2e testing label Jun 5, 2023
@killianmuldoon killianmuldoon removed the area/testing Issues or PRs related to testing label Jun 5, 2023
@chrischdi chrischdi deleted the pr/e2e-runtimesdk-increase-timeout branch June 19, 2023 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-testing Issues or PRs related to e2e testing cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants