Cluster API quick-start with dualstack is flaky #8816

killianmuldoon · 2023-06-07T14:00:55Z

Which jobs are flaking?

capi-e2e-dualstack-and-ipv6-main

Which tests are flaking?

Failing test cases:

Since when has it been flaking?

Since the dualstack tests were merged.

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-dualstack-and-ipv6-main

Reason for failure (if possible)

Not clear at this point but there are some facts we can use to begin debugging this:

It occurs for both the IPv4 and IPv6 primary variants of the test
It fails during the should create a single stack service with cluster ip from primary service range test. The error message is: service dualstack-6332/defaultclusterip expected family IPv4 at index[0] got IPv6 or service dualstack-5879/defaultclusterip expected family IPv6 at index[0] got IPv4 depending on the test variant.

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2023-06-07T14:01:09Z

/triage accepted

killianmuldoon · 2023-07-12T11:46:07Z

/help

k8s-ci-robot · 2023-07-12T11:46:09Z

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nawazkh · 2023-08-14T15:31:19Z

/assign @nawazkh

nawazkh · 2023-08-23T16:30:45Z

This has been fixed with #9252 and the CI signal for capi-e2e-dualstack-and-ipv6-main looks green.
Thank you @chrischdi for opening the PR and fixing the issue so quick!

Shall we close out this issue?

killianmuldoon · 2023-08-23T16:41:08Z

#9252 fixed #9240 which was about the failing tests. This is a pre-existing issue about the flakiness of the dualstack tests.

They're still flaky in the same way as far as I can tell, though it's been masked a bit by the failure connected with the v1.28.0 conformance upgrade.

adilGhaffarDev · 2023-11-15T21:10:25Z

Currently, in dual-stack I only see this flake:
https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-e2e-dualstack-.*&xjob=.*-provider-.*#3ad7d5d4d855de76fe3b

with this error:

Expected success, but got an error:
    <*errors.withStack | 0xc002954720>: 
    Unable to run conformance tests: error container run failed with exit code 1
    {
        error: <*errors.withMessage | 0xc0012ce180>{
            cause: <*errors.errorString | 0xc0012f23c0>{
                s: "error container run failed with exit code 1",
            },
            msg: "Unable to run conformance tests",
        },
        stack: [0x1f72fbc, 0x202fea5, 0x201fba6, 0x1f68eed, 0x1f67874, 0x201fb13, 0x84e3fb, 0x8629b8, 0x4725a1],
    }

This is a kubetest that is failing here:
https://github.com/kubernetes-sigs/cluster-api/blob/8003f3ff6179ca1f26009e2d2b0754bcf14cb044/test/e2e/quick_start_test.go#L169C5-L169C5

Not really sure what could be the root cause of this, few things that I think we can check:

I can see that we have a pinned conformance image, the pinned image might have a flaky test, we can check the image and try to change it to see if it resolves the issue.
We can check how we are configuring the test we might need to change something in the configuration.

We need to either update this issue or create a new one for "conformance tests" flake because I believe this issue is tracking a different flake that is not happening anymore. @killianmuldoon please confirm.

chrischdi · 2023-11-20T13:50:00Z

Maybe we could collect the logs from the conformance container when hitting the issue to further diagnoes this.

More persistent link: https://storage.googleapis.com/k8s-triage/index.html?date=2023-11-13&job=.*-cluster-api-e2e-dualstack-.*&xjob=.*-provider-.*#3ad7d5d4d855de76fe3b

adilGhaffarDev · 2023-11-21T10:13:24Z

I would also like to add there are 2 other flakes in dual stack, they happen very rarely:

Timed out waiting for 1 nodes to be created for MachineDeployment : https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-dualstack-and-ipv6-main/1725998755204829184
Timed out waiting for 1 ready replicas for MachinePool : https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-dualstack-and-ipv6-main/1726633695374217216

Conformance flake is the one that happens most of the time.

nawazkh · 2024-01-24T17:58:11Z

Not working this actively, so unassuming myself now. But please feel free to pull me in @adilGhaffarDev when debugging
/unassign

hackeramitkumar · 2024-02-03T21:20:16Z

/assign

adilGhaffarDev · 2024-02-23T09:36:03Z

Maybe we could collect the logs from the conformance container when hitting the issue to further diagnoes this.

We are already collecting them. This error is for ipv4 primary When following the Cluster API quick-start with dualstack and ipv4 primary [IPv6] Should create a workload cluster:

   • [FAILED] [15.394 seconds]
  [sig-network] [Feature:IPv6DualStack] [It] should create a single stack service with cluster ip from primary service range
  test/e2e/network/dual_stack.go:204
  
    [FAILED] service dualstack-390/defaultclusterip expected family IPv4 at index[0] got IPv6
    In [It] at: test/e2e/network/dual_stack.go:704 @ 02/23/24 08:12:31.746

In the case of ipv6 primary When following the Cluster API quick-start with dualstack and ipv6 primary [IPv6] Should create a workload cluster, we see this:

   • [FAILED] [18.334 seconds]
  [sig-network] [Feature:IPv6DualStack] [It] should create a single stack service with cluster ip from primary service range
  test/e2e/network/dual_stack.go:204
  
    [FAILED] service dualstack-5886/defaultclusterip expected family IPv6 at index[0] got IPv4
    In [It] at: test/e2e/network/dual_stack.go:704 @ 02/16/24 00:27:29.771

jackfrancis · 2024-03-19T18:48:44Z

@killianmuldoon it seems that flaky dualstack tests are general and not related to MachinePools? Ref:

🐛 Drop MachinePools from Dualstack tests #9477

@willie-yao is tracking restoring those tests as part of graduating MachinePool from experimental.

What should the path forward be for MachinePools + dualstack tests given all of this context?

killianmuldoon · 2024-03-20T08:08:51Z

What should the path forward be for MachinePools + dualstack tests given all of this context?

I think the MachinePool versions of these tests are much flakier than the current test. I think we should figure out the issues in the MachinePool PR while continuing to try to triage and fix this separate underlying flake.

fabriziopandini · 2024-04-11T18:41:02Z

/priority important-soon

chrischdi · 2024-04-15T12:28:05Z

This seems to be good since #10424 merged 🎉

https://storage.googleapis.com/k8s-triage/index.html?date=2024-04-15&job=.*-cluster-api-.*&test=dualstack&xjob=.*-provider-.*

/close

k8s-ci-robot · 2024-04-15T12:28:09Z

@chrischdi: Closing this issue.

In response to this:

This seems to be good since #10424 merged 🎉

https://storage.googleapis.com/k8s-triage/index.html?date=2024-04-15&job=.*-cluster-api-.*&test=dualstack&xjob=.*-provider-.*

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

killianmuldoon · 2024-04-15T12:38:43Z

Great work on fixing this!

jackfrancis · 2024-04-15T15:08:20Z

Yikes, fun one :)

@willie-yao you should be able to re-introduce MachinePools into tests and we'll be able to confirm right away that there's no MachinePool regression here.

sbueringer · 2024-04-15T16:03:57Z

Good catch!

chrischdi · 2024-04-30T12:26:27Z

Note: This got cherry-picked back to v1.5, v1.6 and v1.7.

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 7, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 7, 2023

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 12, 2023

k8s-ci-robot assigned nawazkh Aug 14, 2023

sbueringer mentioned this issue Aug 18, 2023

Dualstack tests failing #9240

Closed

killianmuldoon mentioned this issue Sep 25, 2023

🌱 Add log level for kube components patch to ClusterClass #9493

Merged

killianmuldoon added the area/networking Issues or PRs related to networking label Nov 1, 2023

killianmuldoon mentioned this issue Nov 3, 2023

Consolidate dualstack and e2e-full test jobs #9666

Closed

k8s-ci-robot unassigned nawazkh Jan 24, 2024

k8s-ci-robot assigned hackeramitkumar Feb 3, 2024

chrischdi mentioned this issue Apr 11, 2024

🐛 e2e: fix kubetest to allow parallel execution on different clusters #10424

Merged

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 11, 2024

k8s-ci-robot closed this as completed Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster API quick-start with dualstack is flaky #8816

Cluster API quick-start with dualstack is flaky #8816

killianmuldoon commented Jun 7, 2023

killianmuldoon commented Jun 7, 2023

killianmuldoon commented Jul 12, 2023

k8s-ci-robot commented Jul 12, 2023

nawazkh commented Aug 14, 2023

nawazkh commented Aug 23, 2023

killianmuldoon commented Aug 23, 2023

adilGhaffarDev commented Nov 15, 2023 •

edited

Loading

chrischdi commented Nov 20, 2023

adilGhaffarDev commented Nov 21, 2023

nawazkh commented Jan 24, 2024

hackeramitkumar commented Feb 3, 2024

adilGhaffarDev commented Feb 23, 2024

jackfrancis commented Mar 19, 2024

killianmuldoon commented Mar 20, 2024

fabriziopandini commented Apr 11, 2024

chrischdi commented Apr 15, 2024

k8s-ci-robot commented Apr 15, 2024

killianmuldoon commented Apr 15, 2024

jackfrancis commented Apr 15, 2024

sbueringer commented Apr 15, 2024

chrischdi commented Apr 30, 2024

Cluster API quick-start with dualstack is flaky #8816

Cluster API quick-start with dualstack is flaky #8816

Comments

killianmuldoon commented Jun 7, 2023

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

killianmuldoon commented Jun 7, 2023

killianmuldoon commented Jul 12, 2023

k8s-ci-robot commented Jul 12, 2023

Guidelines

nawazkh commented Aug 14, 2023

nawazkh commented Aug 23, 2023

killianmuldoon commented Aug 23, 2023

adilGhaffarDev commented Nov 15, 2023 • edited Loading

chrischdi commented Nov 20, 2023

adilGhaffarDev commented Nov 21, 2023

nawazkh commented Jan 24, 2024

hackeramitkumar commented Feb 3, 2024

adilGhaffarDev commented Feb 23, 2024

jackfrancis commented Mar 19, 2024

killianmuldoon commented Mar 20, 2024

fabriziopandini commented Apr 11, 2024

chrischdi commented Apr 15, 2024

k8s-ci-robot commented Apr 15, 2024

killianmuldoon commented Apr 15, 2024

jackfrancis commented Apr 15, 2024

sbueringer commented Apr 15, 2024

chrischdi commented Apr 30, 2024

adilGhaffarDev commented Nov 15, 2023 •

edited

Loading