OCPBUGS-15605: Update bufsize to 1232 bytes #370

gcs278 · 2023-06-30T17:06:45Z

As endorsed at DNS Flag Day, the DNS Community recommends a bufsize setting of 1232 as a safe default that supports larger payloads, while generally avoiding IP fragmentation on most networks. This is particularly relevant for payloads like those generated by DNSSEC, which tend to be larger.

Previously, CoreDNS always used the EDNS0 extension, which enables UDP-based DNS queries to exceed 512 bytes, when CoreDNS forwarded DNS queries to an upstream name server, and so OpenShift specified a bufsize setting of 512 to maintain compatibility with applications and name servers that did not support the EDNS0 extension.

For clients and name servers that do support EDNS0, a bufsize setting of 512 can result in more DNS truncation and unnecessary TCP retransmissions, resulting in worse DNS performance for most users. This is due to the fact that if a response is larger than the bufsize setting, it gets truncated, prompting clients to initiate a TCP retry. In this situation, two DNS requests are made for a single DNS answer, leading to higher bandwidth usage and longer response times.

Currently, CoreDNS no longer uses EDNS0 when forwarding requests if the original client request did not use EDNS0 (ref: coredns/coredns@a5b9749), and so the reasoning for using a bufsize setting of 512 no longer applies. By increasing the bufsize setting to the recommended value of 1232 bytes, we can enhance DNS performance by decreasing the probability of DNS truncations.

Using a larger bufsize setting of 1232 bytes also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 in which a non-compliant upstream DNS is not respecting a bufsize of 512 bytes and sending larger-than-512-bytes responses. A bufsize setting of 1232 bytes doesn't fix the root cause of this issue; rather, it decreases the likelihood of its occurrence by increasing the acceptable size range for UDP responses.

Note that clients that don’t support EDNS0 or TCP, such as applications built using older versions of Alpine Linux, are still subject to the aforementioned truncation issue. To avoid these issues, ensure that your application is built using a DNS resolver library that supports EDNS0 or TCP-based DNS queries.

Brief history of OpenShift's Bufsize changes:

During the development of OpenShift 4.8.0, we updated to 1232 bytes due to Bug - 1949361 and backported to 4.7 and 4.6. However, later on, 4.8.0 (in development), 4.7, and 4.6 were reverted back to 512 bytes due to Bug - 1966116.
Also in OpenShift 4.8.0, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.
In OpenShift 4.12.0, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes) and we backported the fix to 4.11, 4.10, and 4.9.
Now, this PR is changing bufsize to 1232 bytes.

Slack Thread

openshift-ci-robot · 2023-06-30T17:06:51Z

@gcs278: This pull request references Jira Issue OCPBUGS-15605, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

A bufsize of 1232 is recommended as a safe default which will support larger payloads generated by DNSSEC while generally avoiding IP Fragmentation on most networks. The reasoning for using bufsize 512 no longer applies as CoreDNS no longer forces EDNS0 for upstream DNS requests. Bufsize of 1232 will reduce DNS truncation and unnecessary TCP retransmits, resulting in better DNS performance for most users.

A setting of 512 is effectively not using what EDNS0 was designed to allow for, a larger than 512 UDP Payload. Smaller bufsizes cause DNS truncation with larger-than-512 responses, which then triggers clients to retry in TCP, resulting in two DNS requests for 1 DNS answer. DNSSEC, often causes >512 UDP responses.

Here is a brief history of OpenShift's Bufsize changes:

In 4.8, we switch to 1232 due to: 1949361 – CoreDNS resolution failure for external hostnames with "A: dns: overflow unpacking uint16". This was also backported to 4.7.

But also in 4.8 before 4.8.0, it was reverted quickly back to 512 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 due to the 4.7 backport causing errors.

Then, in also in 4.8, a bump to CoreDNS inadvertently, and silently bumped to 2048, while we were still specifying 512.

Then a customer opened https://issues.redhat.com/browse/OCPBUGS-240, because 2048 is too big for their firewall

Then we fixed that in 4.12, since bufsize was unintentionally 2048 despite specifying 512 in corefile, and backported to 4.9.

Now, this bug is arguing 512 is not a reasonable default.

The reason 512 was chosen was due to the bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 and fix #276. However, this has been resolved in later versions of go via golang/go@301fd8a and, more importantly, CoreDNS fixed forcing upstream EDNS on for clients not using EDNS. We have this fix in 4.11+:
% git -C ~/src/github.com/openshift/coredns branch --contains=a5b9749462a9717c8920dba095f242611c61a989 --remote | grep -e origin/release-
 origin/release-4.11
 origin/release-4.12
 origin/release-4.13
 origin/release-4.14
So, a 512 bufsize is no longer needed to work around the bugs it was triggered by.

Updating a bufsize to 1232 also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 which a non-compliant upstream DNS is not respecting a bufsize 512 and sending larger-than-512 responses. Bufsize of 1232 doesn't fix this problem, but would simply make it less likely to occur by widening the allowable size of UDP responses so CoreDNS doesn't throw exceptions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

melvinjoseph86 · 2023-07-03T06:46:23Z

/retest-required

melvinjoseph86 · 2023-07-03T07:58:42Z

Verified using pre-image verification.
melvinjoseph@mjoseph-mac Downloads % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.14.0-0.ci.test-2023-07-03-065202-ci-ln-02m3pxb-latest True False 30m Cluster version is 4.14.0-0.ci.test-2023-07-03-065202-ci-ln-02m3pxb-latest
melvinjoseph@mjoseph-mac Downloads %

melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get configmaps/dns-default -o yaml | grep -i bufsize
bufsize 1232

openshift-ci-robot · 2023-07-05T15:33:14Z

@gcs278: This pull request references Jira Issue OCPBUGS-15605, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

A bufsize of 1232 is recommended as a safe default which will support larger payloads generated by DNSSEC while generally avoiding IP Fragmentation on most networks. The reasoning for using bufsize 512 no longer applies as CoreDNS no longer forces EDNS0 for upstream DNS requests. Bufsize of 1232 will reduce DNS truncation and unnecessary TCP retransmits, resulting in better DNS performance for most users.

A setting of 512 is effectively not using what EDNS0 was designed to allow for, a larger than 512 UDP Payload. Smaller bufsizes cause DNS truncation with larger-than-512 responses, which then triggers clients to retry in TCP, resulting in two DNS requests for 1 DNS answer. DNSSEC, often causes >512 UDP responses.

Here is a brief history of OpenShift's Bufsize changes:

In 4.8, we switch to 1232 due to: 1949361 – CoreDNS resolution failure for external hostnames with "A: dns: overflow unpacking uint16". This was also backported to 4.7.

But also in 4.8 before 4.8.0, it was reverted quickly back to 512 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 due to the 4.7 backport causing errors.

Then, in also in 4.8, a bump to CoreDNS inadvertently, and silently bumped to 2048, while we were still specifying 512.

Then a customer opened https://issues.redhat.com/browse/OCPBUGS-240, because 2048 is too big for their firewall

Then we fixed that in 4.12, since bufsize was unintentionally 2048 despite specifying 512 in corefile, and backported to 4.9.

Now, this bug is arguing 512 is not a reasonable default.

The reason 512 was chosen was due to the bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 and fix #276. However, this has been resolved in later versions of go via golang/go@301fd8a and, more importantly, CoreDNS fixed forcing upstream EDNS on for clients not using EDNS. We have this fix in 4.11+:
% git -C ~/src/github.com/openshift/coredns branch --contains=a5b9749462a9717c8920dba095f242611c61a989 --remote | grep -e origin/release-
 origin/release-4.11
 origin/release-4.12
 origin/release-4.13
 origin/release-4.14
So, a 512 bufsize is no longer needed to work around the bugs it was triggered by.

Updating a bufsize to 1232 also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 which a non-compliant upstream DNS is not respecting a bufsize 512 and sending larger-than-512 responses. Bufsize of 1232 doesn't fix this problem, but would simply make it less likely to occur by widening the allowable size of UDP responses so CoreDNS doesn't throw exceptions.

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Miciah · 2023-07-12T15:25:30Z

/assign

openshift-ci-robot · 2023-07-12T15:44:57Z

@gcs278: This pull request references Jira Issue OCPBUGS-15605, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

A bufsize of 1232 is recommended as a safe default which will support larger payloads generated by DNSSEC while generally avoiding IP Fragmentation on most networks. The reasoning for using bufsize 512 no longer applies as CoreDNS no longer forces EDNS0 for upstream DNS requests. Bufsize of 1232 will reduce DNS truncation and unnecessary TCP retransmits, resulting in better DNS performance for most users.

A setting of 512 is effectively not using what EDNS0 was designed to allow for, a larger than 512 UDP Payload. Smaller bufsizes cause DNS truncation with larger-than-512 responses, which then triggers clients to retry in TCP, resulting in two DNS requests for 1 DNS answer. DNSSEC, often causes >512 UDP responses.

Here is a brief history of OpenShift's Bufsize changes:

In OpenShift 4.8, we switched to 1232 bytes due to Bug - 1949361, but later it was reverted back to 512 bytes due to Bug - 1966116.

Also in OpenShift 4.8, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.

In OpenShift 4.12, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes).

We backported the fix for OCPBUGS-240 to 4.9

Now, this PR is changing bufsize to 1232 bytes which is the recommended value.

The reason 512 was chosen was due to the bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 and fix #276. However, this has been resolved in later versions of go via golang/go@301fd8a and, more importantly, CoreDNS fixed forcing upstream EDNS on for clients not using EDNS. We have this fix in 4.11+:
% git -C ~/src/github.com/openshift/coredns branch --contains=a5b9749462a9717c8920dba095f242611c61a989 --remote | grep -e origin/release-
 origin/release-4.11
 origin/release-4.12
 origin/release-4.13
 origin/release-4.14
So, the 512 bufsize configuration is no longer needed to work around Bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11.

Updating a bufsize to 1232 also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 which a non-compliant upstream DNS is not respecting a bufsize 512 and sending larger-than-512 responses. Bufsize of 1232 doesn't fix this problem, but would simply make it less likely to occur by widening the allowable size of UDP responses so CoreDNS doesn't throw exceptions.

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-07-12T15:46:54Z

@gcs278: This pull request references Jira Issue OCPBUGS-15605, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

A bufsize of 1232 is recommended as a safe default which will support larger payloads generated by DNSSEC while generally avoiding IP Fragmentation on most networks. The reasoning for using bufsize 512 no longer applies as CoreDNS no longer forces EDNS0 for upstream requests that non-EDNS0 clients requested. Bufsize of 1232 will reduce DNS truncation and unnecessary TCP retransmits, resulting in better DNS performance for most users.

A setting of 512 is effectively not using what EDNS0 was designed to allow for, a larger than 512 UDP Payload. Smaller bufsizes cause DNS truncation with larger-than-512 responses, which then triggers clients to retry in TCP, resulting in two DNS requests for 1 DNS answer. DNSSEC, often causes >512 UDP responses.

Here is a brief history of OpenShift's Bufsize changes:

In OpenShift 4.8, we switched to 1232 bytes due to Bug - 1949361, but later it was reverted back to 512 bytes due to Bug - 1966116.

Also in OpenShift 4.8, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.

In OpenShift 4.12, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes).

We backported the fix for OCPBUGS-240 to 4.9

Now, this PR is changing bufsize to 1232 bytes which is the recommended value.

The reason 512 was chosen was due to the bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 and fix #276. However, this has been resolved in later versions of go via golang/go@301fd8a and, more importantly, CoreDNS fixed forcing upstream EDNS on for clients not using EDNS. We have this fix in 4.11+:
% git -C ~/src/github.com/openshift/coredns branch --contains=a5b9749462a9717c8920dba095f242611c61a989 --remote | grep -e origin/release-
 origin/release-4.11
 origin/release-4.12
 origin/release-4.13
 origin/release-4.14
So, the 512 bufsize configuration is no longer needed to work around Bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11.

Updating a bufsize to 1232 also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 which a non-compliant upstream DNS is not respecting a bufsize 512 and sending larger-than-512 responses. Bufsize of 1232 doesn't fix this problem, but would simply make it less likely to occur by widening the allowable size of UDP responses so CoreDNS doesn't throw exceptions.

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Miciah · 2023-08-16T15:11:36Z

/approve
Grant is going to update the commit message, at which time I'll add an /lgtm.

We'll also run some payload tests before merging.
/hold
We can release the hold if the payload tests pass.

openshift-ci · 2023-08-16T15:19:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

As endorsed at [DNS Flag Day](http://www.dnsflagday.net/2020/), the DNS Community recommends a bufsize setting of 1232 as a safe default that supports larger payloads, while generally avoiding IP fragmentation on most networks. This is particularly relevant for payloads like those generated by DNSSEC, which tend to be larger. Previously, CoreDNS always used the EDNS0 extension, which enables UDP-based DNS queries to exceed 512 bytes, when CoreDNS forwarded DNS queries to an upstream name server, and so OpenShift specified a bufsize setting of 512 to maintain compatibility with applications and name servers that did not support the EDNS0 extension. Currently, CoreDNS no longer uses EDNS0 when forwarding requests if the original client request did not use EDNS0, and so the reasoning for using a bufsize setting of 512 no longer applies. By increasing the bufsize setting to the recommended value of 1232 bytes, we can enhance DNS performance by decreasing the probability of DNS truncations. This PR updates the bufsize setting for CoreDNS from 512 to 1232 bytes.

gcs278 · 2023-08-16T16:26:19Z

/payload 4.14 nightly blocking

openshift-ci · 2023-08-16T16:26:50Z

@gcs278: trigger 8 job(s) of type blocking for the nightly release of OCP 4.14

periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade
periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance
periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial
periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-ovn-ipv6
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a1b9a5e0-3c51-11ee-9da9-eff9ee711bbc-0

openshift-ci-robot · 2023-08-16T16:52:02Z

@gcs278: This pull request references Jira Issue OCPBUGS-15605, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

As endorsed at DNS Flag Day, the DNS Community recommends a bufsize setting of 1232 as a safe default that supports larger payloads, while generally avoiding IP fragmentation on most networks. This is particularly relevant for payloads like those generated by DNSSEC, which tend to be larger.

Previously, CoreDNS always used the EDNS0 extension, which enables UDP-based DNS queries to exceed 512 bytes, when CoreDNS forwarded DNS queries to an upstream name server, and so OpenShift specified a bufsize setting of 512 to maintain compatibility with applications and name servers that did not support the EDNS0 extension.

For clients and name servers that do support EDNS0, a bufsize setting of 512 can result in more DNS truncation and unnecessary TCP retransmissions, resulting in worse DNS performance for most users. This is due to the fact that if a response is larger than the bufsize setting, it gets truncated, prompting clients to initiate a TCP retry. In this situation, two DNS requests are made for a single DNS answer, leading to higher bandwidth usage and longer response times.

Currently, CoreDNS no longer uses EDNS0 when forwarding requests if the original client request did not use EDNS0 (ref: coredns/coredns@a5b9749), and so the reasoning for using a bufsize setting of 512 no longer applies. By increasing the bufsize setting to the recommended value of 1232 bytes, we can enhance DNS performance by decreasing the probability of DNS truncations.

Using a larger bufsize setting of 1232 bytes also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 in which a non-compliant upstream DNS is not respecting a bufsize of 512 bytes and sending larger-than-512-bytes responses. A bufsize setting of 1232 bytes doesn't fix the root cause of this issue; rather, it decreases the likelihood of its occurrence by increasing the acceptable size range for UDP responses.

Note that clients that don’t support EDNS0 or TCP, such as applications built using older versions of Alpine Linux, are still subject to the aforementioned truncation issue. To avoid these issues, ensure that your application is built using a DNS resolver library that supports EDNS0 or TCP-based DNS queries.

Brief history of OpenShift's Bufsize changes:

During the development of OpenShift 4.8.0, we updated to 1232 bytes due to Bug - 1949361 and backported to 4.7 and 4.6. However, later on, 4.8.0 (in development), 4.7, and 4.6 were reverted back to 512 bytes due to Bug - 1966116.

Also in OpenShift 4.8.0, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.

In OpenShift 4.12.0, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes) and we backported the fix to 4.11, 4.10, and 4.9.

Now, this PR is changing bufsize to 1232 bytes.

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Miciah · 2023-08-16T18:21:11Z

/lgtm

gcs278 · 2023-08-16T18:50:19Z

failures to create the sandbox
/test e2e-aws-ovn-single-node

openshift-ci · 2023-08-16T21:24:38Z

@gcs278: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node	`944f87d`	link	false	`/test e2e-aws-ovn-single-node`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

gcs278 · 2023-08-16T21:37:06Z

operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel] Seems unrelated
/test e2e-aws-ovn-single-node

gcs278 · 2023-08-16T23:27:29Z

I've reviewed the payload results: https://pr-payload-tests.ci.openshift.org/runs/ci/a1b9a5e0-3c51-11ee-9da9-eff9ee711bbc-0. 4 passes and 4 failures, but nothing stands out as a DNS regression yet.

The 4 failures:

periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade
- release-analysis-prpqr-aggregator-openshift container test failed, but seems like a infra-type failure
- service-load-balancer-with-pdb-new-connections zero-disruption should not be worse: I know this was a flake in the past. And search CI shows it happened in another job, so I don't feel too concerned. I don't think this test even involves internal load balancer connections that would use CoreDNS anyways.
periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance
- Also has release-analysis-prpqr-aggregator failure about dns-operator
- Ton of other failures. Look like it's related to release-analysis-prpqr-aggregator, seems like infra failure
periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn-upgrade
- Horizontal pod autoscaling (scale resource: CPU) CustomResourceDefinition Should scale with a CRD targetRef [Suite:openshift/conformance/parallel] [Suite:k8s]: Seems unrelated
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm
- Cluster install failure

The other 4 payload jobs succeed.

It's probably worth kicking off these two jobs for major unrelated failures:
/payload-job periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm
/payload-job periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance

But, generally speaking, I don't see any reason to be concerned yet.

openshift-ci · 2023-08-16T23:27:32Z

@gcs278: trigger 2 job(s) for the /payload-(job|aggregate) command

periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm
periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/77e68c70-3c8c-11ee-8d44-3ef99b9b171f-0

gcs278 · 2023-08-17T02:31:37Z

2nd try of payload testing:

periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance: passed
periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm: failed
- pods should successfully create sandboxes by adding pod to network: Doesn't seem related to DNS

No indication that payload testing is failing due to this DNS update. I feel we are ready to merge.
/unhold

openshift-ci-robot · 2023-08-17T02:34:37Z

@gcs278: Jira Issue OCPBUGS-15605: All pull requests linked via external trackers have merged:

openshift/cluster-dns-operator#370

Jira Issue OCPBUGS-15605 has been moved to the MODIFIED state.

In response to this:

As endorsed at DNS Flag Day, the DNS Community recommends a bufsize setting of 1232 as a safe default that supports larger payloads, while generally avoiding IP fragmentation on most networks. This is particularly relevant for payloads like those generated by DNSSEC, which tend to be larger.

Previously, CoreDNS always used the EDNS0 extension, which enables UDP-based DNS queries to exceed 512 bytes, when CoreDNS forwarded DNS queries to an upstream name server, and so OpenShift specified a bufsize setting of 512 to maintain compatibility with applications and name servers that did not support the EDNS0 extension.

For clients and name servers that do support EDNS0, a bufsize setting of 512 can result in more DNS truncation and unnecessary TCP retransmissions, resulting in worse DNS performance for most users. This is due to the fact that if a response is larger than the bufsize setting, it gets truncated, prompting clients to initiate a TCP retry. In this situation, two DNS requests are made for a single DNS answer, leading to higher bandwidth usage and longer response times.

Currently, CoreDNS no longer uses EDNS0 when forwarding requests if the original client request did not use EDNS0 (ref: coredns/coredns@a5b9749), and so the reasoning for using a bufsize setting of 512 no longer applies. By increasing the bufsize setting to the recommended value of 1232 bytes, we can enhance DNS performance by decreasing the probability of DNS truncations.

Using a larger bufsize setting of 1232 bytes also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 in which a non-compliant upstream DNS is not respecting a bufsize of 512 bytes and sending larger-than-512-bytes responses. A bufsize setting of 1232 bytes doesn't fix the root cause of this issue; rather, it decreases the likelihood of its occurrence by increasing the acceptable size range for UDP responses.

Note that clients that don’t support EDNS0 or TCP, such as applications built using older versions of Alpine Linux, are still subject to the aforementioned truncation issue. To avoid these issues, ensure that your application is built using a DNS resolver library that supports EDNS0 or TCP-based DNS queries.

Brief history of OpenShift's Bufsize changes:

During the development of OpenShift 4.8.0, we updated to 1232 bytes due to Bug - 1949361 and backported to 4.7 and 4.6. However, later on, 4.8.0 (in development), 4.7, and 4.6 were reverted back to 512 bytes due to Bug - 1966116.

Also in OpenShift 4.8.0, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.

In OpenShift 4.12.0, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes) and we backported the fix to 4.11, 4.10, and 4.9.

Now, this PR is changing bufsize to 1232 bytes.

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 30, 2023

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 30, 2023

openshift-ci bot requested review from lihongan, frobware and rfredette June 30, 2023 17:07

pemensik mentioned this pull request Jun 30, 2023

Change default value to 1232 coredns/coredns#6183

Merged

gcs278 force-pushed the bufsize-1232 branch from e778586 to 3dba41f Compare July 3, 2023 16:46

openshift-ci bot requested a review from melvinjoseph86 July 5, 2023 15:33

openshift-ci bot assigned Miciah Jul 12, 2023

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 16, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2023

gcs278 force-pushed the bufsize-1232 branch from 3dba41f to 944f87d Compare August 16, 2023 15:42

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2023

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2023

openshift-merge-robot merged commit 3f0692c into openshift:master Aug 17, 2023
9 checks passed

pmtk mentioned this pull request Jan 8, 2024

OCPBUGS-21901: increase core dns pod buffersize openshift/microshift#2858

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-15605: Update bufsize to 1232 bytes #370

OCPBUGS-15605: Update bufsize to 1232 bytes #370

gcs278 commented Jun 30, 2023 •

edited

Loading

openshift-ci-robot commented Jun 30, 2023

melvinjoseph86 commented Jul 3, 2023

melvinjoseph86 commented Jul 3, 2023

openshift-ci-robot commented Jul 5, 2023

Miciah commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

Miciah commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

gcs278 commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

openshift-ci-robot commented Aug 16, 2023

Miciah commented Aug 16, 2023

gcs278 commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

gcs278 commented Aug 16, 2023

gcs278 commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

gcs278 commented Aug 17, 2023

openshift-ci-robot commented Aug 17, 2023

OCPBUGS-15605: Update bufsize to 1232 bytes #370

OCPBUGS-15605: Update bufsize to 1232 bytes #370

Conversation

gcs278 commented Jun 30, 2023 • edited Loading

openshift-ci-robot commented Jun 30, 2023

melvinjoseph86 commented Jul 3, 2023

melvinjoseph86 commented Jul 3, 2023

openshift-ci-robot commented Jul 5, 2023

Miciah commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

openshift-ci-robot commented Jul 12, 2023

Miciah commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

gcs278 commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

openshift-ci-robot commented Aug 16, 2023

Miciah commented Aug 16, 2023

gcs278 commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

gcs278 commented Aug 16, 2023

gcs278 commented Aug 16, 2023

openshift-ci bot commented Aug 16, 2023

gcs278 commented Aug 17, 2023

openshift-ci-robot commented Aug 17, 2023

gcs278 commented Jun 30, 2023 •

edited

Loading