OCPBUGS-13810: Update TestAWSELBConnectionIdleTimeout to not use wildcard DNS record #944

gcs278 · 2023-06-07T17:03:44Z

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

WIP PR with debugging output (which helped me arrive at this solution): #940

Why it's failing:

It's important to note that TestAWSELBConnectionIdleTimeout is the only test where we create a new ingress controller, wait for the new wildcard DNS Record to propagate to DNS Servers in our CI test runner cluster. So it's unique.
TestAWSELBConnectionIdleTimeout tries to resolve new wildcard DNS record from the CI Test Runner Cluster (usually build01 which is AWS too). Note, this is from another cluster (two clusters here: The CI Test Runner and the Cluster under test, also known as ephemeral cluster)
Wildcard DNS record resolves in a 1-2 minute window from within the AWS cluster under test (aka the ephemeral cluster)
However, it's taking up to 15 minutes for the DNS wildcard record to resolve consistently for the CI Test Runner Cluster now
- I'm not completely sure why this happened.
- Theory is, there are ebbs and flows in DNS record propagation across clusters/internet, especially when you consider caching.
Serializing the test doesn't help as Andy and I both found
Testing shows that the ELB's hostname resolves consistently (seems like a difference in route53 wildcard vs. ELB hostname propagation)
Querying 8.8.8.8 (google dns) also shows failure to resolve and inconsistent results for at least 10 minutes (i.e. seems like not just our CI Test Runner Cluster DNS is faulty, it's a global issue)

Resolution:

This PR circumvents using the Wildcard DNS Record and uses the ELB hostname with the request Host header set to the value of the route hostname
- We already do this in the unmanaged_dns_test.go, just using the same pattern
- It doesn't impact the goal of testing the ELB Connection Idle Timeout (Wildcard DNS propagation is auxiliary)
I suggest adding a backlog item to add an E2E test that tests the propagation of wildcard DNS Records to the test runner cluster, possible have it be historical for seeing trends

Alternative Solutions:

Resolve the DNS name inside of the cluster under test, and pass it back to the E2E test
- I started this, but the moving parts of running dig in a pod and trying to get the output felt some what fragile to me. Seemed less appealing, since we already do the solution I created above.
Run the test as serial, and use the default ingress controller
Increase the timeout to 20 minutes or so to safely let DNS Propagate

openshift-ci-robot · 2023-06-07T17:03:50Z

@gcs278: This pull request references Jira Issue OCPBUGS-13810, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and the "Host" header in the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-06-07T17:04:24Z

@gcs278: This pull request references Jira Issue OCPBUGS-13810, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…card DNS record TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request `test/e2e/operator_test.go`: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

openshift-ci-robot · 2023-06-07T17:44:09Z

@gcs278: This pull request references Jira Issue OCPBUGS-13810, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

Why it's failing:

It's important to note that TestAWSELBConnectionIdleTimeout is the only test where we create a new ingress controller, wait for the new wildcard DNS Record to propagate to DNS Servers in our CI test runner cluster. So it's unique.

TestAWSELBConnectionIdleTimeout tries to resolve new wildcard DNS record from the CI Test Runner Cluster (usually build01 which is AWS too). Note, this is from another cluster (two clusters here: The CI Test Runner and the Cluster under test, also known as ephemeral cluster)

Wildcard DNS record resolves in a 1-2 minute window from within the AWS cluster under test (aka the ephemeral cluster)

However, it's taking up to 15 minutes for the DNS wildcard record to resolve consistently for the CI Test Runner Cluster now

I'm not completely sure why this happened.

Theory is, there are ebbs and flows in DNS record propagation across clusters/internet, especially when you consider caching.

Serializing the test doesn't help as Andy and I both found

Testing shows that the ELB's hostname resolves consistently (seems like a difference in route53 wildcard vs. ELB hostname propagation)

Querying 8.8.8.8 (google dns) also shows failure to resolve and inconsistent results for at least 10 minutes (i.e. seems like not just our CI Test Runner Cluster DNS is faulty, it's a global issue)

Resolution:

This PR circumvents using the Wildcard DNS Record and uses the ELB hostname with the request Host header set to the value of the route hostname

We already do this in the unmanaged_dns_test.go, just using the same pattern

It doesn't impact the goal of testing the ELB Connection Idle Timeout (Wildcard DNS propagation is auxiliary)

I suggest adding a backlog item to add an E2E test that tests the propagation of wildcard DNS Records to the test runner cluster, possible have it be historical for seeing trends

Alternative Solutions:

Resolve the DNS name inside of the cluster under test, and pass it back to the E2E test

I started this, but the moving parts of running dig in a pod and trying to get the output felt some what fragile to me. Seemed less appealing, since we already do the solution I created above.

Run the test as serial, and use the default ingress controller

Increase the timeout to 20 minutes or so to safely let DNS Propagate

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-06-07T17:44:48Z

@gcs278: This pull request references Jira Issue OCPBUGS-13810, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

WIP PR with debugging output (which helped me arrive at this solution): #940

Why it's failing:

It's important to note that TestAWSELBConnectionIdleTimeout is the only test where we create a new ingress controller, wait for the new wildcard DNS Record to propagate to DNS Servers in our CI test runner cluster. So it's unique.

TestAWSELBConnectionIdleTimeout tries to resolve new wildcard DNS record from the CI Test Runner Cluster (usually build01 which is AWS too). Note, this is from another cluster (two clusters here: The CI Test Runner and the Cluster under test, also known as ephemeral cluster)

Wildcard DNS record resolves in a 1-2 minute window from within the AWS cluster under test (aka the ephemeral cluster)

However, it's taking up to 15 minutes for the DNS wildcard record to resolve consistently for the CI Test Runner Cluster now

I'm not completely sure why this happened.

Theory is, there are ebbs and flows in DNS record propagation across clusters/internet, especially when you consider caching.

Serializing the test doesn't help as Andy and I both found

Testing shows that the ELB's hostname resolves consistently (seems like a difference in route53 wildcard vs. ELB hostname propagation)

Querying 8.8.8.8 (google dns) also shows failure to resolve and inconsistent results for at least 10 minutes (i.e. seems like not just our CI Test Runner Cluster DNS is faulty, it's a global issue)

Resolution:

This PR circumvents using the Wildcard DNS Record and uses the ELB hostname with the request Host header set to the value of the route hostname

We already do this in the unmanaged_dns_test.go, just using the same pattern

It doesn't impact the goal of testing the ELB Connection Idle Timeout (Wildcard DNS propagation is auxiliary)

I suggest adding a backlog item to add an E2E test that tests the propagation of wildcard DNS Records to the test runner cluster, possible have it be historical for seeing trends

Alternative Solutions:

Resolve the DNS name inside of the cluster under test, and pass it back to the E2E test

I started this, but the moving parts of running dig in a pod and trying to get the output felt some what fragile to me. Seemed less appealing, since we already do the solution I created above.

Run the test as serial, and use the default ingress controller

Increase the timeout to 20 minutes or so to safely let DNS Propagate

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gcs278 · 2023-06-07T19:17:28Z

/jira refresh

openshift-ci-robot · 2023-06-07T19:17:34Z

@gcs278: This pull request references Jira Issue OCPBUGS-13810, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gcs278 · 2023-06-07T19:40:26Z

Round 1: Passed (must gather failures)
/test e2e-aws-operator

unrelated e2e-hypershift failures
/test e2e-hypershift

gcs278 · 2023-06-07T21:32:51Z

Round 2: Passed (must gather failures again)
/test e2e-aws-operator
TestUnmanagedDNSToManagedDNSInternalIngressController and TestUserDefinedIngressController failures for e2e-gcp-operator, but they look very similar to errors in https://issues.redhat.com/browse/OCPBUGS-13106
/test e2e-gcp-operator
same test-node-pool errors with hypershift
/test e2e-hypershift

gcs278 · 2023-06-07T21:34:00Z

distruption issues with the ovn tests:
/test e2e-azure-ovn
/test e2e-gcp-ovn

gcs278 · 2023-06-09T01:23:22Z

round 3 passed e2e-aws-operator:
/test e2e-aws-operator

gcs278 · 2023-06-12T12:40:13Z

Round 4 Passed e2e-aws-operator. I feel like that's enough to prove it works.
/retest

candita · 2023-06-14T15:35:21Z

/assign

candita · 2023-06-14T19:46:36Z

test/e2e/operator_test.go

+	if err := kclient.Get(context.TODO(), wildcardRecordName, wildcardRecord); err != nil {
+		t.Fatalf("failed to get wildcard dnsrecord %s: %v", wildcardRecordName, err)
+	}
+	elbHostname := wildcardRecord.Spec.Targets[0]


In a wildcard record, could there be more than one target? How do you know which target matches the ELB?

The DNSRecord CRD allows there to be more than one target, since a real-life DNS record can support multiple targets; however, in practice the Ingress Operator will never make a DNS record with more than 1 target based on the code here:

cluster-ingress-operator/pkg/resources/dnsrecord/dns.go

Lines 144 to 170 in 71c424d

var target string

var recordType iov1.DNSRecordType

if len(ingress.Hostname) > 0 {

recordType = iov1.CNAMERecordType

target = ingress.Hostname

} else {

recordType = iov1.ARecordType

target = ingress.IP

}

return true, &iov1.DNSRecord{

ObjectMeta: metav1.ObjectMeta{

Namespace: name.Namespace,

Name: name.Name,

Labels: dnsRecordLabels,

OwnerReferences: []metav1.OwnerReference{ownerRef},

Finalizers: []string{manifests.DNSRecordFinalizer},

},

Spec: iov1.DNSRecordSpec{

DNSName: domain,

DNSManagementPolicy: dnsPolicy,

Targets: []string{target},

RecordType: recordType,

RecordTTL: defaultRecordTTL,

},

}

And this assumption of 1 target is made in all of our DNS Provides, e.g.:

cluster-ingress-operator/pkg/dns/aws/dns.go

Line 503 in e8f3f45

domain, target := record.Spec.DNSName, record.Spec.Targets[0]

cluster-ingress-operator/pkg/dns/azure/dns.go

Line 112 in 475c9b6

Address: record.Spec.Targets[0],

cluster-ingress-operator/pkg/dns/alibaba/dns.go

Line 119 in e412fd7

err = service.Add(zoneInfo.ID, rr, string(record.Spec.RecordType), record.Spec.Targets[0], record.Spec.RecordTTL)

candita · 2023-06-14T19:49:34Z

test/e2e/operator_test.go

 	if err := wait.PollImmediate(5*time.Second, 5*time.Minute, func() (bool, error) {
-		_, err := net.LookupIP(route.Spec.Host)
+		_, err := net.LookupIP(elbHostname)


We actually set the route name and create it in https://github.com/openshift/cluster-ingress-operator/pull/944/files#diff-cf4b4e5424070a666a364d1d1b04011478d888d6d3d65c943ca7783333b7a6e4R2616.

I know this issue is seen on platforms going back to 4.11, but did you investigate whether or not we need to wait for the route to be ready after we create it?

Speaking to the lookup prior to this fix, _, err := net.LookupIP(route.Spec.Host), the readiness of the route (or even the route itself) should have no impact on the wildcard DNS record that is created by the IngressController.

When we create an ingress controller, in this case called test-idle-timeout, it creates a wildcard DNS record for *.test-idle-timeout.ci-op-rh28w0d0-43abb.origin-ci-int-aws.dev.rhcloud.com for subsequent admitted routes to use.

Before this route is even created, any domain that satisfies this wildcard, e.g. foo.test-idle-timeout.ci-op-rh28w0d0-43abb.origin-ci-int-aws.dev.rhcloud.com, should resolve, regardless if there is a route or not. So we actually don't even need the route for this part of the test, we are waiting on the DNS Wildcard for the Ingress Controller. The route in our test has a hostname of idle-timeout-httpd-openshift-ingress.test-idle-timeout.ci-op-rh28w0d0-43abb.origin-ci-int-aws.dev.rhcloud.com, but in this case the first part of the domain idle-timeout-httpd-openshift-ingress provides the same record as foo.

Furthermore, Miciah suggested to test another random, no-associated-route, domain name inside the same wildcard, so I tested with https://github.com/openshift/cluster-ingress-operator/pull/940/files#diff-cf4b4e5424070a666a364d1d1b04011478d888d6d3d65c943ca7783333b7a6e4R3053-R3059. The results show it doesn't resolve any quicker than the route and had the same problems. Long story short, we don't even need a route created to demonstrate this DNS propagation issue, just the Ingress Controller and the DNS Record.

candita · 2023-06-14T20:07:13Z

test/e2e/operator_test.go

+	// Add the "Host" header to direct request to ELB to the route we are testing which bypasses the need
+	// for the wildcard DNS record to propagate to the CI test runner cluster's DNS servers which has gotten very slow.
+	// See https://issues.redhat.com/browse/OCPBUGS-13810
+	request.Host = route.Spec.Host


Do we do this trick anywhere else?

Yes - in verifyExternalIngressController

cluster-ingress-operator/test/e2e/util_test.go

Line 514 in b648921

req.Host = hostname

which is called by all of the tests in unmanaged_dns_test.go.

Interesting that those all call it with a hostname of "apps."+ic.Spec.Domain, and here we are calling it with a created route hostname.

It's not obvious the way it's written, but "apps.".ic.Spec.Domain is a created route hostname. They created it here as a part of echoRoute:

cluster-ingress-operator/test/e2e/util_test.go

Line 483 in b648921

echoRoute := buildRouteWithHost(echoPod.Name, echoPod.Namespace, echoService.Name, hostname)

So this verifyExternalIngressController doing the same style as this PR update.

candita · 2023-06-14T20:17:59Z

Overall, looks fine. Even knowing the background, I just had some questions.

/lgtm

Miciah · 2023-06-15T14:23:38Z

/approve

openshift-ci · 2023-06-15T14:24:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2023-06-15T16:32:41Z

/retest-required

Remaining retests: 0 against base HEAD 0e500e6 and 2 for PR HEAD 960a8d6 in total

Miciah · 2023-06-15T19:47:26Z

e2e-hypershift failed, but this PR only changes test code that the e2e-hypershift job doesn't run.
/test e2e-hypershift

Miciah · 2023-06-18T01:16:51Z

e2e-hypershift failed again with similar test failures. TestNodePool and other tests are failing and repeating util.go:307: Waiting for hostedcluster rollout. Image: registry.build05.ci.openshift.org/ci-op-fmg8lvsw/release@sha256:c917c02fbe36e9de19dfe489eaf9bbe9ef8e548ffa96d3d2f22dd681e53e126b: status.version.history[0].state is "Partial", but we want "Completed" thousands of times. Based on Slack discussions, I think openshift/image-registry#371 is supposed to fix the e2e-hypershift failures.
/test e2e-hypershift

Miciah · 2023-06-18T02:41:42Z

e2e-hypershift failed because of timeouts pulling from the build05 image registry:

 error: error creating buildah builder: reading signatures: downloading signatures for sha256:5f6c9960be9fa4cd3aca8550d7808945277ac48f42a8f15ca3c5a8ad2efccb78 in registry.build05.ci.openshift.org/ci/managed-clonerefs: received unexpected HTTP status: 504 Gateway Time-out

/test e2e-hypershift

Miciah · 2023-06-18T14:39:30Z

e2e-hypershift failed again, this time on TestUpgradeControlPlane/EnsureNoCrashingPods and TestUpgradeControlPlane.

TestUpgradeControlPlane/EnsureNoCrashingPods failed with the following output:

{Failed  === RUN   TestUpgradeControlPlane/EnsureNoCrashingPods
    util.go:457: Container socks5-proxy in pod openshift-apiserver-68f7966598-bkkcq has a restartCount > 0 (1)
    --- FAIL: TestUpgradeControlPlane/EnsureNoCrashingPods (0.02s)
}

The container logs have the following:

{"level":"info","ts":"2023-06-18T03:19:12Z","logger":"konnectivity-socks5-proxy","msg":"Starting proxy","version":"openshift/hypershift: 63110f0d9631a8245ccdbb3e22dc1337a09a16f5. Latest supported OCP: 4.14.0"}
panic: Get "https://kube-apiserver:443/api?timeout=32s": dial tcp 172.29.248.126:443: connect: connection refused

goroutine 1 [running]:
github.com/openshift/hypershift/konnectivity-socks5-proxy.NewStartCommand.func2(0xc001124900, {0xc000ede8e0, 0x1, 0x1})
	/hypershift/konnectivity-socks5-proxy/main.go:73 +0x87a
github.com/spf13/cobra.(*Command).execute(0xc001124900, {0xc000ede8a0, 0x1, 0x1})
	/hypershift/vendor/github.com/spf13/cobra/command.go:920 +0xd34
github.com/spf13/cobra.(*Command).ExecuteC(0xc001124000)
	/hypershift/vendor/github.com/spf13/cobra/command.go:1044 +0x8d1
github.com/spf13/cobra.(*Command).Execute(0xc001124000)
	/hypershift/vendor/github.com/spf13/cobra/command.go:968 +0x2f
main.main()
	/hypershift/control-plane-operator/main.go:66 +0x1dc

@enxebre, do you know how to diagnose this further?
/test e2e-hypershift

Miciah · 2023-06-19T00:05:08Z

e2e-hypershift failed again with a similar failure:

{Failed  === RUN   TestUpgradeControlPlane/EnsureNoCrashingPods
    util.go:457: Container socks-proxy in pod oauth-openshift-69bdcd88fc-qs5bk has a restartCount > 0 (2)
    --- FAIL: TestUpgradeControlPlane/EnsureNoCrashingPods (0.02s)
}

The container logs have the same panic:

{"level":"info","ts":"2023-06-18T15:13:06Z","logger":"konnectivity-socks5-proxy","msg":"Starting proxy","version":"openshift/hypershift: 63110f0d9631a8245ccdbb3e22dc1337a09a16f5. Latest supported OCP: 4.14.0"}
panic: Get "https://kube-apiserver:443/api?timeout=32s": dial tcp 172.29.67.154:443: connect: connection refused

goroutine 1 [running]:
github.com/openshift/hypershift/konnectivity-socks5-proxy.NewStartCommand.func2(0xc000023200, {0xc001000db0, 0x1, 0x3})
	/hypershift/konnectivity-socks5-proxy/main.go:73 +0x87a
github.com/spf13/cobra.(*Command).execute(0xc000023200, {0xc001000d20, 0x3, 0x3})
	/hypershift/vendor/github.com/spf13/cobra/command.go:920 +0xd34
github.com/spf13/cobra.(*Command).ExecuteC(0xc000022000)
	/hypershift/vendor/github.com/spf13/cobra/command.go:1044 +0x8d1
github.com/spf13/cobra.(*Command).Execute(0xc000022000)
	/hypershift/vendor/github.com/spf13/cobra/command.go:968 +0x2f
main.main()
	/hypershift/control-plane-operator/main.go:66 +0x1dc

/test e2e-hypershift

openshift-ci · 2023-06-19T01:50:52Z

@gcs278: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-06-19T01:54:29Z

@gcs278: Jira Issue OCPBUGS-13810: All pull requests linked via external trackers have merged:

openshift/cluster-ingress-operator#944

Jira Issue OCPBUGS-13810 has been moved to the MODIFIED state.

In response to this:

TestAWSELBConnectionIdleTimeout has been very flakey due to the fact that the CI test runner cluster is failing to resolve the newly created wildcard DNS Record in a reasonable time. To work around this, we switch to using ELB's hostname, which is consistently resolving and adding the "Host" header to the HTTP request.

test/e2e/operator_test.go: Modify TestAWSELBConnectionIdleTimeout to use ELB hostname and Host header with route hostname

WIP PR with debugging output (which helped me arrive at this solution): #940

Why it's failing:

It's important to note that TestAWSELBConnectionIdleTimeout is the only test where we create a new ingress controller, wait for the new wildcard DNS Record to propagate to DNS Servers in our CI test runner cluster. So it's unique.

TestAWSELBConnectionIdleTimeout tries to resolve new wildcard DNS record from the CI Test Runner Cluster (usually build01 which is AWS too). Note, this is from another cluster (two clusters here: The CI Test Runner and the Cluster under test, also known as ephemeral cluster)

Wildcard DNS record resolves in a 1-2 minute window from within the AWS cluster under test (aka the ephemeral cluster)

However, it's taking up to 15 minutes for the DNS wildcard record to resolve consistently for the CI Test Runner Cluster now

I'm not completely sure why this happened.

Theory is, there are ebbs and flows in DNS record propagation across clusters/internet, especially when you consider caching.

Serializing the test doesn't help as Andy and I both found

Testing shows that the ELB's hostname resolves consistently (seems like a difference in route53 wildcard vs. ELB hostname propagation)

Querying 8.8.8.8 (google dns) also shows failure to resolve and inconsistent results for at least 10 minutes (i.e. seems like not just our CI Test Runner Cluster DNS is faulty, it's a global issue)

Resolution:

This PR circumvents using the Wildcard DNS Record and uses the ELB hostname with the request Host header set to the value of the route hostname

We already do this in the unmanaged_dns_test.go, just using the same pattern

It doesn't impact the goal of testing the ELB Connection Idle Timeout (Wildcard DNS propagation is auxiliary)

I suggest adding a backlog item to add an E2E test that tests the propagation of wildcard DNS Records to the test runner cluster, possible have it be historical for seeing trends

Alternative Solutions:

Resolve the DNS name inside of the cluster under test, and pass it back to the E2E test

I started this, but the moving parts of running dig in a pod and trying to get the output felt some what fragile to me. Seemed less appealing, since we already do the solution I created above.

Run the test as serial, and use the default ingress controller

Increase the timeout to 20 minutes or so to safely let DNS Propagate

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gcs278 · 2023-06-28T15:04:07Z

/cherry-pick release-4.13

openshift-cherrypick-robot · 2023-06-28T15:04:52Z

@gcs278: new pull request created: #955

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from alebedev87 and miheer June 7, 2023 17:04

gcs278 force-pushed the OCPBUGS-13810-timeout-fix-hostheader branch from 68133e7 to 960a8d6 Compare June 7, 2023 17:59

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 7, 2023

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 7, 2023

openshift-ci bot requested a review from melvinjoseph86 June 7, 2023 19:17

openshift-ci bot assigned candita Jun 14, 2023

candita reviewed Jun 14, 2023

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 14, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2023

openshift-merge-robot merged commit b1a6bb5 into openshift:master Jun 19, 2023

Miciah mentioned this pull request Jun 20, 2023

OCPBUGS-13190: Avoid spurious updates for internalTrafficPolicy #927

Merged

openshift-cherrypick-robot mentioned this pull request Jun 28, 2023

[release-4.13] OCPBUGS-15515: Update TestAWSELBConnectionIdleTimeout to not use wildcard DNS record #955

Merged

gcs278 mentioned this pull request Aug 13, 2024

OCPBUGS-38441: Resolve DNS Resolution CI Flakes in Subnets and EIP E2E #1127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-13810: Update TestAWSELBConnectionIdleTimeout to not use wildcard DNS record #944

OCPBUGS-13810: Update TestAWSELBConnectionIdleTimeout to not use wildcard DNS record #944

gcs278 commented Jun 7, 2023 •

edited

Loading

openshift-ci-robot commented Jun 7, 2023

openshift-ci-robot commented Jun 7, 2023

openshift-ci-robot commented Jun 7, 2023

openshift-ci-robot commented Jun 7, 2023

gcs278 commented Jun 7, 2023

openshift-ci-robot commented Jun 7, 2023

gcs278 commented Jun 7, 2023

gcs278 commented Jun 7, 2023

gcs278 commented Jun 7, 2023

gcs278 commented Jun 9, 2023

gcs278 commented Jun 12, 2023

candita commented Jun 14, 2023

candita Jun 14, 2023

gcs278 Jun 15, 2023

candita Jun 14, 2023

gcs278 Jun 15, 2023

candita Jun 14, 2023

gcs278 Jun 14, 2023

candita Jun 14, 2023

gcs278 Jun 15, 2023

candita commented Jun 14, 2023

Miciah commented Jun 15, 2023

openshift-ci bot commented Jun 15, 2023

openshift-ci-robot commented Jun 15, 2023

Miciah commented Jun 15, 2023

Miciah commented Jun 18, 2023

Miciah commented Jun 18, 2023

Miciah commented Jun 18, 2023

Miciah commented Jun 19, 2023

openshift-ci bot commented Jun 19, 2023

openshift-ci-robot commented Jun 19, 2023

gcs278 commented Jun 28, 2023

openshift-cherrypick-robot commented Jun 28, 2023

	var target string
	var recordType iov1.DNSRecordType

	if len(ingress.Hostname) > 0 {
	recordType = iov1.CNAMERecordType
	target = ingress.Hostname
	} else {
	recordType = iov1.ARecordType
	target = ingress.IP
	}

	return true, &iov1.DNSRecord{
	ObjectMeta: metav1.ObjectMeta{
	Namespace: name.Namespace,
	Name: name.Name,
	Labels: dnsRecordLabels,
	OwnerReferences: []metav1.OwnerReference{ownerRef},
	Finalizers: []string{manifests.DNSRecordFinalizer},
	},
	Spec: iov1.DNSRecordSpec{
	DNSName: domain,
	DNSManagementPolicy: dnsPolicy,
	Targets: []string{target},
	RecordType: recordType,
	RecordTTL: defaultRecordTTL,
	},
	}

OCPBUGS-13810: Update TestAWSELBConnectionIdleTimeout to not use wildcard DNS record #944

OCPBUGS-13810: Update TestAWSELBConnectionIdleTimeout to not use wildcard DNS record #944

Conversation

gcs278 commented Jun 7, 2023 • edited Loading

openshift-ci-robot commented Jun 7, 2023

openshift-ci-robot commented Jun 7, 2023

openshift-ci-robot commented Jun 7, 2023

openshift-ci-robot commented Jun 7, 2023

gcs278 commented Jun 7, 2023

openshift-ci-robot commented Jun 7, 2023

gcs278 commented Jun 7, 2023

gcs278 commented Jun 7, 2023

gcs278 commented Jun 7, 2023

gcs278 commented Jun 9, 2023

gcs278 commented Jun 12, 2023

candita commented Jun 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

candita commented Jun 14, 2023

Miciah commented Jun 15, 2023

openshift-ci bot commented Jun 15, 2023

openshift-ci-robot commented Jun 15, 2023

Miciah commented Jun 15, 2023

Miciah commented Jun 18, 2023

Miciah commented Jun 18, 2023

Miciah commented Jun 18, 2023

Miciah commented Jun 19, 2023

openshift-ci bot commented Jun 19, 2023

openshift-ci-robot commented Jun 19, 2023

gcs278 commented Jun 28, 2023

openshift-cherrypick-robot commented Jun 28, 2023

gcs278 commented Jun 7, 2023 •

edited

Loading