Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster API quick-start with dualstack is flaky #8816

Closed
killianmuldoon opened this issue Jun 7, 2023 · 21 comments
Closed

Cluster API quick-start with dualstack is flaky #8816

killianmuldoon opened this issue Jun 7, 2023 · 21 comments
Assignees
Labels
area/networking Issues or PRs related to networking help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@killianmuldoon
Copy link
Contributor

Which jobs are flaking?

capi-e2e-dualstack-and-ipv6-main

Which tests are flaking?

Failing test cases:

Since when has it been flaking?

Since the dualstack tests were merged.

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-dualstack-and-ipv6-main

Reason for failure (if possible)

Not clear at this point but there are some facts we can use to begin debugging this:

  • It occurs for both the IPv4 and IPv6 primary variants of the test
  • It fails during the should create a single stack service with cluster ip from primary service range test. The error message is: service dualstack-6332/defaultclusterip expected family IPv4 at index[0] got IPv6 or service dualstack-5879/defaultclusterip expected family IPv6 at index[0] got IPv4 depending on the test variant.

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 7, 2023
@killianmuldoon
Copy link
Contributor Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 7, 2023
@killianmuldoon
Copy link
Contributor Author

/help

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 12, 2023
@nawazkh
Copy link
Member

nawazkh commented Aug 14, 2023

/assign @nawazkh

@nawazkh
Copy link
Member

nawazkh commented Aug 23, 2023

This has been fixed with #9252 and the CI signal for capi-e2e-dualstack-and-ipv6-main looks green.
Thank you @chrischdi for opening the PR and fixing the issue so quick!

Shall we close out this issue?

@killianmuldoon
Copy link
Contributor Author

#9252 fixed #9240 which was about the failing tests. This is a pre-existing issue about the flakiness of the dualstack tests.

They're still flaky in the same way as far as I can tell, though it's been masked a bit by the failure connected with the v1.28.0 conformance upgrade.

@adilGhaffarDev
Copy link
Contributor

adilGhaffarDev commented Nov 15, 2023

Currently, in dual-stack I only see this flake:
https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-e2e-dualstack-.*&xjob=.*-provider-.*#3ad7d5d4d855de76fe3b

with this error:

Expected success, but got an error:
    <*errors.withStack | 0xc002954720>: 
    Unable to run conformance tests: error container run failed with exit code 1
    {
        error: <*errors.withMessage | 0xc0012ce180>{
            cause: <*errors.errorString | 0xc0012f23c0>{
                s: "error container run failed with exit code 1",
            },
            msg: "Unable to run conformance tests",
        },
        stack: [0x1f72fbc, 0x202fea5, 0x201fba6, 0x1f68eed, 0x1f67874, 0x201fb13, 0x84e3fb, 0x8629b8, 0x4725a1],
    }

This is a kubetest that is failing here:
https://github.com/kubernetes-sigs/cluster-api/blob/8003f3ff6179ca1f26009e2d2b0754bcf14cb044/test/e2e/quick_start_test.go#L169C5-L169C5

Not really sure what could be the root cause of this, few things that I think we can check:

  • I can see that we have a pinned conformance image, the pinned image might have a flaky test, we can check the image and try to change it to see if it resolves the issue.
  • We can check how we are configuring the test we might need to change something in the configuration.

We need to either update this issue or create a new one for "conformance tests" flake because I believe this issue is tracking a different flake that is not happening anymore. @killianmuldoon please confirm.

@chrischdi
Copy link
Member

Maybe we could collect the logs from the conformance container when hitting the issue to further diagnoes this.

More persistent link: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-13&job=.*-cluster-api-e2e-dualstack-.*&xjob=.*-provider-.*#3ad7d5d4d855de76fe3b

@adilGhaffarDev
Copy link
Contributor

I would also like to add there are 2 other flakes in dual stack, they happen very rarely:

Conformance flake is the one that happens most of the time.

@nawazkh
Copy link
Member

nawazkh commented Jan 24, 2024

Not working this actively, so unassuming myself now. But please feel free to pull me in @adilGhaffarDev when debugging
/unassign

@hackeramitkumar
Copy link
Member

/assign

@adilGhaffarDev
Copy link
Contributor

Maybe we could collect the logs from the conformance container when hitting the issue to further diagnoes this.

We are already collecting them. This error is for ipv4 primary When following the Cluster API quick-start with dualstack and ipv4 primary [IPv6] Should create a workload cluster:

   • [FAILED] [15.394 seconds]
  [sig-network] [Feature:IPv6DualStack] [It] should create a single stack service with cluster ip from primary service range
  test/e2e/network/dual_stack.go:204
  
    [FAILED] service dualstack-390/defaultclusterip expected family IPv4 at index[0] got IPv6
    In [It] at: test/e2e/network/dual_stack.go:704 @ 02/23/24 08:12:31.746

In the case of ipv6 primary When following the Cluster API quick-start with dualstack and ipv6 primary [IPv6] Should create a workload cluster, we see this:

   • [FAILED] [18.334 seconds]
  [sig-network] [Feature:IPv6DualStack] [It] should create a single stack service with cluster ip from primary service range
  test/e2e/network/dual_stack.go:204
  
    [FAILED] service dualstack-5886/defaultclusterip expected family IPv6 at index[0] got IPv4
    In [It] at: test/e2e/network/dual_stack.go:704 @ 02/16/24 00:27:29.771

@jackfrancis
Copy link
Contributor

@killianmuldoon it seems that flaky dualstack tests are general and not related to MachinePools? Ref:

@willie-yao is tracking restoring those tests as part of graduating MachinePool from experimental.

What should the path forward be for MachinePools + dualstack tests given all of this context?

@killianmuldoon
Copy link
Contributor Author

What should the path forward be for MachinePools + dualstack tests given all of this context?

I think the MachinePool versions of these tests are much flakier than the current test. I think we should figure out the issues in the MachinePool PR while continuing to try to triage and fix this separate underlying flake.

@fabriziopandini
Copy link
Member

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024
@chrischdi
Copy link
Member

This seems to be good since #10424 merged 🎉

image

https://storage.googleapis.com/k8s-triage/index.html?date=2024-04-15&job=.*-cluster-api-.*&test=dualstack&xjob=.*-provider-.*

/close

@k8s-ci-robot
Copy link
Contributor

@chrischdi: Closing this issue.

In response to this:

This seems to be good since #10424 merged 🎉

image

https://storage.googleapis.com/k8s-triage/index.html?date=2024-04-15&job=.*-cluster-api-.*&test=dualstack&xjob=.*-provider-.*

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@killianmuldoon
Copy link
Contributor Author

Great work on fixing this!

@jackfrancis
Copy link
Contributor

Yikes, fun one :)

@willie-yao you should be able to re-introduce MachinePools into tests and we'll be able to confirm right away that there's no MachinePool regression here.

@sbueringer
Copy link
Member

Good catch!

@chrischdi
Copy link
Member

Note: This got cherry-picked back to v1.5, v1.6 and v1.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Issues or PRs related to networking help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

9 participants