Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-13810: Update TestAWSELBConnectionIdleTimeout to not use wildcard DNS record #944

Conversation

gcs278
Copy link
Contributor

@gcs278 gcs278 commented Jun 7, 2023

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

WIP PR with debugging output (which helped me arrive at this solution): #940

Why it's failing:

  • It's important to note that TestAWSELBConnectionIdleTimeout is the only test where we create a new ingress controller, wait for the new wildcard DNS Record to propagate to DNS Servers in our CI test runner cluster. So it's unique.
  • TestAWSELBConnectionIdleTimeout tries to resolve new wildcard DNS record from the CI Test Runner Cluster (usually build01 which is AWS too). Note, this is from another cluster (two clusters here: The CI Test Runner and the Cluster under test, also known as ephemeral cluster)
  • Wildcard DNS record resolves in a 1-2 minute window from within the AWS cluster under test (aka the ephemeral cluster)
  • However, it's taking up to 15 minutes for the DNS wildcard record to resolve consistently for the CI Test Runner Cluster now
    • I'm not completely sure why this happened.
    • Theory is, there are ebbs and flows in DNS record propagation across clusters/internet, especially when you consider caching.
  • Serializing the test doesn't help as Andy and I both found
  • Testing shows that the ELB's hostname resolves consistently (seems like a difference in route53 wildcard vs. ELB hostname propagation)
  • Querying 8.8.8.8 (google dns) also shows failure to resolve and inconsistent results for at least 10 minutes (i.e. seems like not just our CI Test Runner Cluster DNS is faulty, it's a global issue)

Resolution:

  • This PR circumvents using the Wildcard DNS Record and uses the ELB hostname with the request Host header set to the value of the route hostname
    • We already do this in the unmanaged_dns_test.go, just using the same pattern
    • It doesn't impact the goal of testing the ELB Connection Idle Timeout (Wildcard DNS propagation is auxiliary)
  • I suggest adding a backlog item to add an E2E test that tests the propagation of wildcard DNS Records to the test runner cluster, possible have it be historical for seeing trends

Alternative Solutions:

  • Resolve the DNS name inside of the cluster under test, and pass it back to the E2E test
    • I started this, but the moving parts of running dig in a pod and trying to get the output felt some what fragile to me. Seemed less appealing, since we already do the solution I created above.
  • Run the test as serial, and use the default ingress controller
  • Increase the timeout to 20 minutes or so to safely let DNS Propagate

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 7, 2023
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-13810, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and the "Host" header in the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from alebedev87 and miheer June 7, 2023 17:04
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-13810, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…card DNS record

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the
CI test runner cluster is failing to resolve the newly created wildcard DNS Record
in a reasonable time. To work around this, we switch to using ELB's hostname,
which is consistently resolving and adding the "Host" header to the HTTP request

`test/e2e/operator_test.go`: Modify TestAWSELBConnectionIdleTimeout to use
ELB hostname and Host header with route hostname
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-13810, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

Why it's failing:

  • It's important to note that TestAWSELBConnectionIdleTimeout is the only test where we create a new ingress controller, wait for the new wildcard DNS Record to propagate to DNS Servers in our CI test runner cluster. So it's unique.
  • TestAWSELBConnectionIdleTimeout tries to resolve new wildcard DNS record from the CI Test Runner Cluster (usually build01 which is AWS too). Note, this is from another cluster (two clusters here: The CI Test Runner and the Cluster under test, also known as ephemeral cluster)
  • Wildcard DNS record resolves in a 1-2 minute window from within the AWS cluster under test (aka the ephemeral cluster)
  • However, it's taking up to 15 minutes for the DNS wildcard record to resolve consistently for the CI Test Runner Cluster now
  • I'm not completely sure why this happened.
  • Theory is, there are ebbs and flows in DNS record propagation across clusters/internet, especially when you consider caching.
  • Serializing the test doesn't help as Andy and I both found
  • Testing shows that the ELB's hostname resolves consistently (seems like a difference in route53 wildcard vs. ELB hostname propagation)
  • Querying 8.8.8.8 (google dns) also shows failure to resolve and inconsistent results for at least 10 minutes (i.e. seems like not just our CI Test Runner Cluster DNS is faulty, it's a global issue)

Resolution:

  • This PR circumvents using the Wildcard DNS Record and uses the ELB hostname with the request Host header set to the value of the route hostname
  • We already do this in the unmanaged_dns_test.go, just using the same pattern
  • It doesn't impact the goal of testing the ELB Connection Idle Timeout (Wildcard DNS propagation is auxiliary)
  • I suggest adding a backlog item to add an E2E test that tests the propagation of wildcard DNS Records to the test runner cluster, possible have it be historical for seeing trends

Alternative Solutions:

  • Resolve the DNS name inside of the cluster under test, and pass it back to the E2E test
  • I started this, but the moving parts of running dig in a pod and trying to get the output felt some what fragile to me. Seemed less appealing, since we already do the solution I created above.
  • Run the test as serial, and use the default ingress controller
  • Increase the timeout to 20 minutes or so to safely let DNS Propagate

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-13810, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

WIP PR with debugging output (which helped me arrive at this solution): #940

Why it's failing:

  • It's important to note that TestAWSELBConnectionIdleTimeout is the only test where we create a new ingress controller, wait for the new wildcard DNS Record to propagate to DNS Servers in our CI test runner cluster. So it's unique.
  • TestAWSELBConnectionIdleTimeout tries to resolve new wildcard DNS record from the CI Test Runner Cluster (usually build01 which is AWS too). Note, this is from another cluster (two clusters here: The CI Test Runner and the Cluster under test, also known as ephemeral cluster)
  • Wildcard DNS record resolves in a 1-2 minute window from within the AWS cluster under test (aka the ephemeral cluster)
  • However, it's taking up to 15 minutes for the DNS wildcard record to resolve consistently for the CI Test Runner Cluster now
  • I'm not completely sure why this happened.
  • Theory is, there are ebbs and flows in DNS record propagation across clusters/internet, especially when you consider caching.
  • Serializing the test doesn't help as Andy and I both found
  • Testing shows that the ELB's hostname resolves consistently (seems like a difference in route53 wildcard vs. ELB hostname propagation)
  • Querying 8.8.8.8 (google dns) also shows failure to resolve and inconsistent results for at least 10 minutes (i.e. seems like not just our CI Test Runner Cluster DNS is faulty, it's a global issue)

Resolution:

  • This PR circumvents using the Wildcard DNS Record and uses the ELB hostname with the request Host header set to the value of the route hostname
  • We already do this in the unmanaged_dns_test.go, just using the same pattern
  • It doesn't impact the goal of testing the ELB Connection Idle Timeout (Wildcard DNS propagation is auxiliary)
  • I suggest adding a backlog item to add an E2E test that tests the propagation of wildcard DNS Records to the test runner cluster, possible have it be historical for seeing trends

Alternative Solutions:

  • Resolve the DNS name inside of the cluster under test, and pass it back to the E2E test
  • I started this, but the moving parts of running dig in a pod and trying to get the output felt some what fragile to me. Seemed less appealing, since we already do the solution I created above.
  • Run the test as serial, and use the default ingress controller
  • Increase the timeout to 20 minutes or so to safely let DNS Propagate

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gcs278 gcs278 force-pushed the OCPBUGS-13810-timeout-fix-hostheader branch from 68133e7 to 960a8d6 Compare June 7, 2023 17:59
@gcs278
Copy link
Contributor Author

gcs278 commented Jun 7, 2023

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 7, 2023
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-13810, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 7, 2023
@gcs278
Copy link
Contributor Author

gcs278 commented Jun 7, 2023

Round 1: Passed (must gather failures)
/test e2e-aws-operator

unrelated e2e-hypershift failures
/test e2e-hypershift

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 7, 2023

Round 2: Passed (must gather failures again)
/test e2e-aws-operator
TestUnmanagedDNSToManagedDNSInternalIngressController and TestUserDefinedIngressController failures for e2e-gcp-operator, but they look very similar to errors in https://issues.redhat.com/browse/OCPBUGS-13106
/test e2e-gcp-operator
same test-node-pool errors with hypershift
/test e2e-hypershift

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 7, 2023

distruption issues with the ovn tests:
/test e2e-azure-ovn
/test e2e-gcp-ovn

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 9, 2023

round 3 passed e2e-aws-operator:
/test e2e-aws-operator

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 12, 2023

Round 4 Passed e2e-aws-operator. I feel like that's enough to prove it works.
/retest

@candita
Copy link
Contributor

candita commented Jun 14, 2023

/assign

if err := kclient.Get(context.TODO(), wildcardRecordName, wildcardRecord); err != nil {
t.Fatalf("failed to get wildcard dnsrecord %s: %v", wildcardRecordName, err)
}
elbHostname := wildcardRecord.Spec.Targets[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a wildcard record, could there be more than one target? How do you know which target matches the ELB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DNSRecord CRD allows there to be more than one target, since a real-life DNS record can support multiple targets; however, in practice the Ingress Operator will never make a DNS record with more than 1 target based on the code here:

var target string
var recordType iov1.DNSRecordType
if len(ingress.Hostname) > 0 {
recordType = iov1.CNAMERecordType
target = ingress.Hostname
} else {
recordType = iov1.ARecordType
target = ingress.IP
}
return true, &iov1.DNSRecord{
ObjectMeta: metav1.ObjectMeta{
Namespace: name.Namespace,
Name: name.Name,
Labels: dnsRecordLabels,
OwnerReferences: []metav1.OwnerReference{ownerRef},
Finalizers: []string{manifests.DNSRecordFinalizer},
},
Spec: iov1.DNSRecordSpec{
DNSName: domain,
DNSManagementPolicy: dnsPolicy,
Targets: []string{target},
RecordType: recordType,
RecordTTL: defaultRecordTTL,
},
}

And this assumption of 1 target is made in all of our DNS Provides, e.g.:

domain, target := record.Spec.DNSName, record.Spec.Targets[0]

Address: record.Spec.Targets[0],

err = service.Add(zoneInfo.ID, rr, string(record.Spec.RecordType), record.Spec.Targets[0], record.Spec.RecordTTL)

if err := wait.PollImmediate(5*time.Second, 5*time.Minute, func() (bool, error) {
_, err := net.LookupIP(route.Spec.Host)
_, err := net.LookupIP(elbHostname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually set the route name and create it in https://github.com/openshift/cluster-ingress-operator/pull/944/files#diff-cf4b4e5424070a666a364d1d1b04011478d888d6d3d65c943ca7783333b7a6e4R2616.

I know this issue is seen on platforms going back to 4.11, but did you investigate whether or not we need to wait for the route to be ready after we create it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking to the lookup prior to this fix, _, err := net.LookupIP(route.Spec.Host), the readiness of the route (or even the route itself) should have no impact on the wildcard DNS record that is created by the IngressController.

When we create an ingress controller, in this case called test-idle-timeout, it creates a wildcard DNS record for *.test-idle-timeout.ci-op-rh28w0d0-43abb.origin-ci-int-aws.dev.rhcloud.com for subsequent admitted routes to use.

Before this route is even created, any domain that satisfies this wildcard, e.g. foo.test-idle-timeout.ci-op-rh28w0d0-43abb.origin-ci-int-aws.dev.rhcloud.com, should resolve, regardless if there is a route or not. So we actually don't even need the route for this part of the test, we are waiting on the DNS Wildcard for the Ingress Controller. The route in our test has a hostname of idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-rh28w0d0-43abb.origin-ci-int-aws.dev.rhcloud.com, but in this case the first part of the domain idle-timeout-httpd-openshift-ingress provides the same record as foo.

Furthermore, Miciah suggested to test another random, no-associated-route, domain name inside the same wildcard, so I tested with https://github.com/openshift/cluster-ingress-operator/pull/940/files#diff-cf4b4e5424070a666a364d1d1b04011478d888d6d3d65c943ca7783333b7a6e4R3053-R3059. The results show it doesn't resolve any quicker than the route and had the same problems. Long story short, we don't even need a route created to demonstrate this DNS propagation issue, just the Ingress Controller and the DNS Record.

// Add the "Host" header to direct request to ELB to the route we are testing which bypasses the need
// for the wildcard DNS record to propagate to the CI test runner cluster's DNS servers which has gotten very slow.
// See https://issues.redhat.com/browse/OCPBUGS-13810
request.Host = route.Spec.Host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we do this trick anywhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - in verifyExternalIngressController

which is called by all of the tests in unmanaged_dns_test.go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that those all call it with a hostname of "apps."+ic.Spec.Domain, and here we are calling it with a created route hostname.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious the way it's written, but "apps.".ic.Spec.Domain is a created route hostname. They created it here as a part of echoRoute:

echoRoute := buildRouteWithHost(echoPod.Name, echoPod.Namespace, echoService.Name, hostname)

So this verifyExternalIngressController doing the same style as this PR update.

@candita
Copy link
Contributor

candita commented Jun 14, 2023

Overall, looks fine. Even knowing the background, I just had some questions.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 14, 2023
@Miciah
Copy link
Contributor

Miciah commented Jun 15, 2023

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 15, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 0e500e6 and 2 for PR HEAD 960a8d6 in total

@Miciah
Copy link
Contributor

Miciah commented Jun 15, 2023

e2e-hypershift failed, but this PR only changes test code that the e2e-hypershift job doesn't run.
/test e2e-hypershift

@Miciah
Copy link
Contributor

Miciah commented Jun 18, 2023

e2e-hypershift failed again with similar test failures. TestNodePool and other tests are failing and repeating util.go:307: Waiting for hostedcluster rollout. Image: registry.build05.ci.openshift.org/ci-op-fmg8lvsw/release@sha256:c917c02fbe36e9de19dfe489eaf9bbe9ef8e548ffa96d3d2f22dd681e53e126b: status.version.history[0].state is "Partial", but we want "Completed" thousands of times. Based on Slack discussions, I think openshift/image-registry#371 is supposed to fix the e2e-hypershift failures.
/test e2e-hypershift

@Miciah
Copy link
Contributor

Miciah commented Jun 18, 2023

e2e-hypershift failed because of timeouts pulling from the build05 image registry:

 error: error creating buildah builder: reading signatures: downloading signatures for sha256:5f6c9960be9fa4cd3aca8550d7808945277ac48f42a8f15ca3c5a8ad2efccb78 in registry.build05.ci.openshift.org/ci/managed-clonerefs: received unexpected HTTP status: 504 Gateway Time-out 

/test e2e-hypershift

@Miciah
Copy link
Contributor

Miciah commented Jun 18, 2023

e2e-hypershift failed again, this time on TestUpgradeControlPlane/EnsureNoCrashingPods and TestUpgradeControlPlane.

TestUpgradeControlPlane/EnsureNoCrashingPods failed with the following output:

{Failed  === RUN   TestUpgradeControlPlane/EnsureNoCrashingPods
    util.go:457: Container socks5-proxy in pod openshift-apiserver-68f7966598-bkkcq has a restartCount > 0 (1)
    --- FAIL: TestUpgradeControlPlane/EnsureNoCrashingPods (0.02s)
}

The container logs have the following:

{"level":"info","ts":"2023-06-18T03:19:12Z","logger":"konnectivity-socks5-proxy","msg":"Starting proxy","version":"openshift/hypershift: 63110f0d9631a8245ccdbb3e22dc1337a09a16f5. Latest supported OCP: 4.14.0"}
panic: Get "https://kube-apiserver:443/api?timeout=32s": dial tcp 172.29.248.126:443: connect: connection refused

goroutine 1 [running]:
github.com/openshift/hypershift/konnectivity-socks5-proxy.NewStartCommand.func2(0xc001124900, {0xc000ede8e0, 0x1, 0x1})
	/hypershift/konnectivity-socks5-proxy/main.go:73 +0x87a
github.com/spf13/cobra.(*Command).execute(0xc001124900, {0xc000ede8a0, 0x1, 0x1})
	/hypershift/vendor/github.com/spf13/cobra/command.go:920 +0xd34
github.com/spf13/cobra.(*Command).ExecuteC(0xc001124000)
	/hypershift/vendor/github.com/spf13/cobra/command.go:1044 +0x8d1
github.com/spf13/cobra.(*Command).Execute(0xc001124000)
	/hypershift/vendor/github.com/spf13/cobra/command.go:968 +0x2f
main.main()
	/hypershift/control-plane-operator/main.go:66 +0x1dc

@enxebre, do you know how to diagnose this further?
/test e2e-hypershift

@Miciah
Copy link
Contributor

Miciah commented Jun 19, 2023

e2e-hypershift failed again with a similar failure:

{Failed  === RUN   TestUpgradeControlPlane/EnsureNoCrashingPods
    util.go:457: Container socks-proxy in pod oauth-openshift-69bdcd88fc-qs5bk has a restartCount > 0 (2)
    --- FAIL: TestUpgradeControlPlane/EnsureNoCrashingPods (0.02s)
}

The container logs have the same panic:

{"level":"info","ts":"2023-06-18T15:13:06Z","logger":"konnectivity-socks5-proxy","msg":"Starting proxy","version":"openshift/hypershift: 63110f0d9631a8245ccdbb3e22dc1337a09a16f5. Latest supported OCP: 4.14.0"}
panic: Get "https://kube-apiserver:443/api?timeout=32s": dial tcp 172.29.67.154:443: connect: connection refused

goroutine 1 [running]:
github.com/openshift/hypershift/konnectivity-socks5-proxy.NewStartCommand.func2(0xc000023200, {0xc001000db0, 0x1, 0x3})
	/hypershift/konnectivity-socks5-proxy/main.go:73 +0x87a
github.com/spf13/cobra.(*Command).execute(0xc000023200, {0xc001000d20, 0x3, 0x3})
	/hypershift/vendor/github.com/spf13/cobra/command.go:920 +0xd34
github.com/spf13/cobra.(*Command).ExecuteC(0xc000022000)
	/hypershift/vendor/github.com/spf13/cobra/command.go:1044 +0x8d1
github.com/spf13/cobra.(*Command).Execute(0xc000022000)
	/hypershift/vendor/github.com/spf13/cobra/command.go:968 +0x2f
main.main()
	/hypershift/control-plane-operator/main.go:66 +0x1dc

/test e2e-hypershift

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 19, 2023

@gcs278: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit b1a6bb5 into openshift:master Jun 19, 2023
@openshift-ci-robot
Copy link
Contributor

@gcs278: Jira Issue OCPBUGS-13810: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-13810 has been moved to the MODIFIED state.

In response to this:

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

WIP PR with debugging output (which helped me arrive at this solution): #940

Why it's failing:

  • It's important to note that TestAWSELBConnectionIdleTimeout is the only test where we create a new ingress controller, wait for the new wildcard DNS Record to propagate to DNS Servers in our CI test runner cluster. So it's unique.
  • TestAWSELBConnectionIdleTimeout tries to resolve new wildcard DNS record from the CI Test Runner Cluster (usually build01 which is AWS too). Note, this is from another cluster (two clusters here: The CI Test Runner and the Cluster under test, also known as ephemeral cluster)
  • Wildcard DNS record resolves in a 1-2 minute window from within the AWS cluster under test (aka the ephemeral cluster)
  • However, it's taking up to 15 minutes for the DNS wildcard record to resolve consistently for the CI Test Runner Cluster now
  • I'm not completely sure why this happened.
  • Theory is, there are ebbs and flows in DNS record propagation across clusters/internet, especially when you consider caching.
  • Serializing the test doesn't help as Andy and I both found
  • Testing shows that the ELB's hostname resolves consistently (seems like a difference in route53 wildcard vs. ELB hostname propagation)
  • Querying 8.8.8.8 (google dns) also shows failure to resolve and inconsistent results for at least 10 minutes (i.e. seems like not just our CI Test Runner Cluster DNS is faulty, it's a global issue)

Resolution:

  • This PR circumvents using the Wildcard DNS Record and uses the ELB hostname with the request Host header set to the value of the route hostname
  • We already do this in the unmanaged_dns_test.go, just using the same pattern
  • It doesn't impact the goal of testing the ELB Connection Idle Timeout (Wildcard DNS propagation is auxiliary)
  • I suggest adding a backlog item to add an E2E test that tests the propagation of wildcard DNS Records to the test runner cluster, possible have it be historical for seeing trends

Alternative Solutions:

  • Resolve the DNS name inside of the cluster under test, and pass it back to the E2E test
  • I started this, but the moving parts of running dig in a pod and trying to get the output felt some what fragile to me. Seemed less appealing, since we already do the solution I created above.
  • Run the test as serial, and use the default ingress controller
  • Increase the timeout to 20 minutes or so to safely let DNS Propagate

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gcs278
Copy link
Contributor Author

gcs278 commented Jun 28, 2023

/cherry-pick release-4.13

@openshift-cherrypick-robot

@gcs278: new pull request created: #955

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants