Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-15605: Update bufsize to 1232 bytes #370

Merged
merged 1 commit into from
Aug 17, 2023

Conversation

gcs278
Copy link
Contributor

@gcs278 gcs278 commented Jun 30, 2023

As endorsed at DNS Flag Day, the DNS Community recommends a bufsize setting of 1232 as a safe default that supports larger payloads, while generally avoiding IP fragmentation on most networks. This is particularly relevant for payloads like those generated by DNSSEC, which tend to be larger.

Previously, CoreDNS always used the EDNS0 extension, which enables UDP-based DNS queries to exceed 512 bytes, when CoreDNS forwarded DNS queries to an upstream name server, and so OpenShift specified a bufsize setting of 512 to maintain compatibility with applications and name servers that did not support the EDNS0 extension.

For clients and name servers that do support EDNS0, a bufsize setting of 512 can result in more DNS truncation and unnecessary TCP retransmissions, resulting in worse DNS performance for most users. This is due to the fact that if a response is larger than the bufsize setting, it gets truncated, prompting clients to initiate a TCP retry. In this situation, two DNS requests are made for a single DNS answer, leading to higher bandwidth usage and longer response times.

Currently, CoreDNS no longer uses EDNS0 when forwarding requests if the original client request did not use EDNS0 (ref: coredns/coredns@a5b9749), and so the reasoning for using a bufsize setting of 512 no longer applies. By increasing the bufsize setting to the recommended value of 1232 bytes, we can enhance DNS performance by decreasing the probability of DNS truncations.

Using a larger bufsize setting of 1232 bytes also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 in which a non-compliant upstream DNS is not respecting a bufsize of 512 bytes and sending larger-than-512-bytes responses. A bufsize setting of 1232 bytes doesn't fix the root cause of this issue; rather, it decreases the likelihood of its occurrence by increasing the acceptable size range for UDP responses.

Note that clients that don’t support EDNS0 or TCP, such as applications built using older versions of Alpine Linux, are still subject to the aforementioned truncation issue. To avoid these issues, ensure that your application is built using a DNS resolver library that supports EDNS0 or TCP-based DNS queries.

Brief history of OpenShift's Bufsize changes:

  1. During the development of OpenShift 4.8.0, we updated to 1232 bytes due to Bug - 1949361 and backported to 4.7 and 4.6. However, later on, 4.8.0 (in development), 4.7, and 4.6 were reverted back to 512 bytes due to Bug - 1966116.
  2. Also in OpenShift 4.8.0, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.
  3. In OpenShift 4.12.0, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes) and we backported the fix to 4.11, 4.10, and 4.9.
  4. Now, this PR is changing bufsize to 1232 bytes.

Slack Thread

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 30, 2023
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-15605, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

A bufsize of 1232 is recommended as a safe default which will support larger payloads generated by DNSSEC while generally avoiding IP Fragmentation on most networks. The reasoning for using bufsize 512 no longer applies as CoreDNS no longer forces EDNS0 for upstream DNS requests. Bufsize of 1232 will reduce DNS truncation and unnecessary TCP retransmits, resulting in better DNS performance for most users.

A setting of 512 is effectively not using what EDNS0 was designed to allow for, a larger than 512 UDP Payload. Smaller bufsizes cause DNS truncation with larger-than-512 responses, which then triggers clients to retry in TCP, resulting in two DNS requests for 1 DNS answer. DNSSEC, often causes >512 UDP responses.

Here is a brief history of OpenShift's Bufsize changes:

  1. In 4.8, we switch to 1232 due to: 1949361 – CoreDNS resolution failure for external hostnames with "A: dns: overflow unpacking uint16". This was also backported to 4.7.
  2. But also in 4.8 before 4.8.0, it was reverted quickly back to 512 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 due to the 4.7 backport causing errors.
  3. Then, in also in 4.8, a bump to CoreDNS inadvertently, and silently bumped to 2048, while we were still specifying 512.
  4. Then a customer opened https://issues.redhat.com/browse/OCPBUGS-240, because 2048 is too big for their firewall
  5. Then we fixed that in 4.12, since bufsize was unintentionally 2048 despite specifying 512 in corefile, and backported to 4.9.
  6. Now, this bug is arguing 512 is not a reasonable default.

The reason 512 was chosen was due to the bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 and fix #276. However, this has been resolved in later versions of go via golang/go@301fd8a and, more importantly, CoreDNS fixed forcing upstream EDNS on for clients not using EDNS. We have this fix in 4.11+:

% git -C ~/src/github.com/openshift/coredns branch --contains=a5b9749462a9717c8920dba095f242611c61a989 --remote | grep -e origin/release-
 origin/release-4.11
 origin/release-4.12
 origin/release-4.13
 origin/release-4.14

So, a 512 bufsize is no longer needed to work around the bugs it was triggered by.

Updating a bufsize to 1232 also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 which a non-compliant upstream DNS is not respecting a bufsize 512 and sending larger-than-512 responses. Bufsize of 1232 doesn't fix this problem, but would simply make it less likely to occur by widening the allowable size of UDP responses so CoreDNS doesn't throw exceptions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 30, 2023
@melvinjoseph86
Copy link

/retest-required

@melvinjoseph86
Copy link

Verified using pre-image verification.
melvinjoseph@mjoseph-mac Downloads % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.14.0-0.ci.test-2023-07-03-065202-ci-ln-02m3pxb-latest True False 30m Cluster version is 4.14.0-0.ci.test-2023-07-03-065202-ci-ln-02m3pxb-latest
melvinjoseph@mjoseph-mac Downloads %

melvinjoseph@mjoseph-mac Downloads % oc -n openshift-dns get configmaps/dns-default -o yaml | grep -i bufsize
bufsize 1232

@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-15605, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

A bufsize of 1232 is recommended as a safe default which will support larger payloads generated by DNSSEC while generally avoiding IP Fragmentation on most networks. The reasoning for using bufsize 512 no longer applies as CoreDNS no longer forces EDNS0 for upstream DNS requests. Bufsize of 1232 will reduce DNS truncation and unnecessary TCP retransmits, resulting in better DNS performance for most users.

A setting of 512 is effectively not using what EDNS0 was designed to allow for, a larger than 512 UDP Payload. Smaller bufsizes cause DNS truncation with larger-than-512 responses, which then triggers clients to retry in TCP, resulting in two DNS requests for 1 DNS answer. DNSSEC, often causes >512 UDP responses.

Here is a brief history of OpenShift's Bufsize changes:

  1. In 4.8, we switch to 1232 due to: 1949361 – CoreDNS resolution failure for external hostnames with "A: dns: overflow unpacking uint16". This was also backported to 4.7.
  2. But also in 4.8 before 4.8.0, it was reverted quickly back to 512 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 due to the 4.7 backport causing errors.
  3. Then, in also in 4.8, a bump to CoreDNS inadvertently, and silently bumped to 2048, while we were still specifying 512.
  4. Then a customer opened https://issues.redhat.com/browse/OCPBUGS-240, because 2048 is too big for their firewall
  5. Then we fixed that in 4.12, since bufsize was unintentionally 2048 despite specifying 512 in corefile, and backported to 4.9.
  6. Now, this bug is arguing 512 is not a reasonable default.

The reason 512 was chosen was due to the bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 and fix #276. However, this has been resolved in later versions of go via golang/go@301fd8a and, more importantly, CoreDNS fixed forcing upstream EDNS on for clients not using EDNS. We have this fix in 4.11+:

% git -C ~/src/github.com/openshift/coredns branch --contains=a5b9749462a9717c8920dba095f242611c61a989 --remote | grep -e origin/release-
 origin/release-4.11
 origin/release-4.12
 origin/release-4.13
 origin/release-4.14

So, a 512 bufsize is no longer needed to work around the bugs it was triggered by.

Updating a bufsize to 1232 also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 which a non-compliant upstream DNS is not respecting a bufsize 512 and sending larger-than-512 responses. Bufsize of 1232 doesn't fix this problem, but would simply make it less likely to occur by widening the allowable size of UDP responses so CoreDNS doesn't throw exceptions.

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor

Miciah commented Jul 12, 2023

/assign

@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-15605, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

A bufsize of 1232 is recommended as a safe default which will support larger payloads generated by DNSSEC while generally avoiding IP Fragmentation on most networks. The reasoning for using bufsize 512 no longer applies as CoreDNS no longer forces EDNS0 for upstream DNS requests. Bufsize of 1232 will reduce DNS truncation and unnecessary TCP retransmits, resulting in better DNS performance for most users.

A setting of 512 is effectively not using what EDNS0 was designed to allow for, a larger than 512 UDP Payload. Smaller bufsizes cause DNS truncation with larger-than-512 responses, which then triggers clients to retry in TCP, resulting in two DNS requests for 1 DNS answer. DNSSEC, often causes >512 UDP responses.

Here is a brief history of OpenShift's Bufsize changes:

  1. In OpenShift 4.8, we switched to 1232 bytes due to Bug - 1949361, but later it was reverted back to 512 bytes due to Bug - 1966116.
  2. Also in OpenShift 4.8, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.
  3. In OpenShift 4.12, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes).
  4. We backported the fix for OCPBUGS-240 to 4.9
  5. Now, this PR is changing bufsize to 1232 bytes which is the recommended value.

The reason 512 was chosen was due to the bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 and fix #276. However, this has been resolved in later versions of go via golang/go@301fd8a and, more importantly, CoreDNS fixed forcing upstream EDNS on for clients not using EDNS. We have this fix in 4.11+:

% git -C ~/src/github.com/openshift/coredns branch --contains=a5b9749462a9717c8920dba095f242611c61a989 --remote | grep -e origin/release-
 origin/release-4.11
 origin/release-4.12
 origin/release-4.13
 origin/release-4.14

So, the 512 bufsize configuration is no longer needed to work around Bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11.

Updating a bufsize to 1232 also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 which a non-compliant upstream DNS is not respecting a bufsize 512 and sending larger-than-512 responses. Bufsize of 1232 doesn't fix this problem, but would simply make it less likely to occur by widening the allowable size of UDP responses so CoreDNS doesn't throw exceptions.

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-15605, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

A bufsize of 1232 is recommended as a safe default which will support larger payloads generated by DNSSEC while generally avoiding IP Fragmentation on most networks. The reasoning for using bufsize 512 no longer applies as CoreDNS no longer forces EDNS0 for upstream requests that non-EDNS0 clients requested. Bufsize of 1232 will reduce DNS truncation and unnecessary TCP retransmits, resulting in better DNS performance for most users.

A setting of 512 is effectively not using what EDNS0 was designed to allow for, a larger than 512 UDP Payload. Smaller bufsizes cause DNS truncation with larger-than-512 responses, which then triggers clients to retry in TCP, resulting in two DNS requests for 1 DNS answer. DNSSEC, often causes >512 UDP responses.

Here is a brief history of OpenShift's Bufsize changes:

  1. In OpenShift 4.8, we switched to 1232 bytes due to Bug - 1949361, but later it was reverted back to 512 bytes due to Bug - 1966116.
  2. Also in OpenShift 4.8, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.
  3. In OpenShift 4.12, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes).
  4. We backported the fix for OCPBUGS-240 to 4.9
  5. Now, this PR is changing bufsize to 1232 bytes which is the recommended value.

The reason 512 was chosen was due to the bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11 and fix #276. However, this has been resolved in later versions of go via golang/go@301fd8a and, more importantly, CoreDNS fixed forcing upstream EDNS on for clients not using EDNS. We have this fix in 4.11+:

% git -C ~/src/github.com/openshift/coredns branch --contains=a5b9749462a9717c8920dba095f242611c61a989 --remote | grep -e origin/release-
 origin/release-4.11
 origin/release-4.12
 origin/release-4.13
 origin/release-4.14

So, the 512 bufsize configuration is no longer needed to work around Bug 1966116 – DNS SRV request which worked in 4.7.9 stopped working in 4.7.11.

Updating a bufsize to 1232 also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 which a non-compliant upstream DNS is not respecting a bufsize 512 and sending larger-than-512 responses. Bufsize of 1232 doesn't fix this problem, but would simply make it less likely to occur by widening the allowable size of UDP responses so CoreDNS doesn't throw exceptions.

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor

Miciah commented Aug 16, 2023

/approve
Grant is going to update the commit message, at which time I'll add an /lgtm.

We'll also run some payload tests before merging.
/hold
We can release the hold if the payload tests pass.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 16, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 16, 2023
As endorsed at [DNS Flag Day](http://www.dnsflagday.net/2020/),
the DNS Community recommends a bufsize setting of 1232 as a
safe default that supports larger payloads, while generally
avoiding IP fragmentation on most networks. This is particularly
relevant for payloads like those generated by DNSSEC, which tend to
be larger.

Previously, CoreDNS always used the EDNS0 extension, which enables
UDP-based DNS queries to exceed 512 bytes, when CoreDNS forwarded
DNS queries to an upstream name server, and so OpenShift specified
a bufsize setting of 512 to maintain compatibility with applications
and name servers that did not support the EDNS0 extension.

Currently, CoreDNS no longer uses EDNS0 when forwarding requests
if the original client request did not use EDNS0, and so the
reasoning for using a bufsize setting of 512 no longer applies.
By increasing the bufsize setting to the recommended value of 1232 bytes,
we can enhance DNS performance by decreasing the probability of DNS
truncations.

This PR updates the bufsize setting for CoreDNS from 512 to 1232 bytes.
@gcs278
Copy link
Contributor Author

gcs278 commented Aug 16, 2023

/payload 4.14 nightly blocking

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2023

@gcs278: trigger 8 job(s) of type blocking for the nightly release of OCP 4.14

  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-sdn-upgrade
  • periodic-ci-openshift-release-master-ci-4.14-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-serial
  • periodic-ci-openshift-release-master-ci-4.14-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-ovn-ipv6
  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a1b9a5e0-3c51-11ee-9da9-eff9ee711bbc-0

@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-15605, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

As endorsed at DNS Flag Day, the DNS Community recommends a bufsize setting of 1232 as a safe default that supports larger payloads, while generally avoiding IP fragmentation on most networks. This is particularly relevant for payloads like those generated by DNSSEC, which tend to be larger.

Previously, CoreDNS always used the EDNS0 extension, which enables UDP-based DNS queries to exceed 512 bytes, when CoreDNS forwarded DNS queries to an upstream name server, and so OpenShift specified a bufsize setting of 512 to maintain compatibility with applications and name servers that did not support the EDNS0 extension.

For clients and name servers that do support EDNS0, a bufsize setting of 512 can result in more DNS truncation and unnecessary TCP retransmissions, resulting in worse DNS performance for most users. This is due to the fact that if a response is larger than the bufsize setting, it gets truncated, prompting clients to initiate a TCP retry. In this situation, two DNS requests are made for a single DNS answer, leading to higher bandwidth usage and longer response times.

Currently, CoreDNS no longer uses EDNS0 when forwarding requests if the original client request did not use EDNS0 (ref: coredns/coredns@a5b9749), and so the reasoning for using a bufsize setting of 512 no longer applies. By increasing the bufsize setting to the recommended value of 1232 bytes, we can enhance DNS performance by decreasing the probability of DNS truncations.

Using a larger bufsize setting of 1232 bytes also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 in which a non-compliant upstream DNS is not respecting a bufsize of 512 bytes and sending larger-than-512-bytes responses. A bufsize setting of 1232 bytes doesn't fix the root cause of this issue; rather, it decreases the likelihood of its occurrence by increasing the acceptable size range for UDP responses.

Note that clients that don’t support EDNS0 or TCP, such as applications built using older versions of Alpine Linux, are still subject to the aforementioned truncation issue. To avoid these issues, ensure that your application is built using a DNS resolver library that supports EDNS0 or TCP-based DNS queries.

Brief history of OpenShift's Bufsize changes:

  1. During the development of OpenShift 4.8.0, we updated to 1232 bytes due to Bug - 1949361 and backported to 4.7 and 4.6. However, later on, 4.8.0 (in development), 4.7, and 4.6 were reverted back to 512 bytes due to Bug - 1966116.
  2. Also in OpenShift 4.8.0, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.
  3. In OpenShift 4.12.0, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes) and we backported the fix to 4.11, 4.10, and 4.9.
  4. Now, this PR is changing bufsize to 1232 bytes.

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor

Miciah commented Aug 16, 2023

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2023
@gcs278
Copy link
Contributor Author

gcs278 commented Aug 16, 2023

failures to create the sandbox
/test e2e-aws-ovn-single-node

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2023

@gcs278: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 944f87d link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@gcs278
Copy link
Contributor Author

gcs278 commented Aug 16, 2023

operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel] Seems unrelated
/test e2e-aws-ovn-single-node

@gcs278
Copy link
Contributor Author

gcs278 commented Aug 16, 2023

I've reviewed the payload results: https://pr-payload-tests.ci.openshift.org/runs/ci/a1b9a5e0-3c51-11ee-9da9-eff9ee711bbc-0. 4 passes and 4 failures, but nothing stands out as a DNS regression yet.

The 4 failures:

The other 4 payload jobs succeed.

It's probably worth kicking off these two jobs for major unrelated failures:
/payload-job periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm
/payload-job periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance

But, generally speaking, I don't see any reason to be concerned yet.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2023

@gcs278: trigger 2 job(s) for the /payload-(job|aggregate) command

  • periodic-ci-openshift-release-master-nightly-4.14-e2e-metal-ipi-sdn-bm
  • periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/77e68c70-3c8c-11ee-8d44-3ef99b9b171f-0

@gcs278
Copy link
Contributor Author

gcs278 commented Aug 17, 2023

2nd try of payload testing:

No indication that payload testing is failing due to this DNS update. I feel we are ready to merge.
/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 17, 2023
@openshift-merge-robot openshift-merge-robot merged commit 3f0692c into openshift:master Aug 17, 2023
9 checks passed
@openshift-ci-robot
Copy link
Contributor

@gcs278: Jira Issue OCPBUGS-15605: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-15605 has been moved to the MODIFIED state.

In response to this:

As endorsed at DNS Flag Day, the DNS Community recommends a bufsize setting of 1232 as a safe default that supports larger payloads, while generally avoiding IP fragmentation on most networks. This is particularly relevant for payloads like those generated by DNSSEC, which tend to be larger.

Previously, CoreDNS always used the EDNS0 extension, which enables UDP-based DNS queries to exceed 512 bytes, when CoreDNS forwarded DNS queries to an upstream name server, and so OpenShift specified a bufsize setting of 512 to maintain compatibility with applications and name servers that did not support the EDNS0 extension.

For clients and name servers that do support EDNS0, a bufsize setting of 512 can result in more DNS truncation and unnecessary TCP retransmissions, resulting in worse DNS performance for most users. This is due to the fact that if a response is larger than the bufsize setting, it gets truncated, prompting clients to initiate a TCP retry. In this situation, two DNS requests are made for a single DNS answer, leading to higher bandwidth usage and longer response times.

Currently, CoreDNS no longer uses EDNS0 when forwarding requests if the original client request did not use EDNS0 (ref: coredns/coredns@a5b9749), and so the reasoning for using a bufsize setting of 512 no longer applies. By increasing the bufsize setting to the recommended value of 1232 bytes, we can enhance DNS performance by decreasing the probability of DNS truncations.

Using a larger bufsize setting of 1232 bytes also would potentially help alleviate bugs like https://issues.redhat.com/browse/OCPBUGS-6829 in which a non-compliant upstream DNS is not respecting a bufsize of 512 bytes and sending larger-than-512-bytes responses. A bufsize setting of 1232 bytes doesn't fix the root cause of this issue; rather, it decreases the likelihood of its occurrence by increasing the acceptable size range for UDP responses.

Note that clients that don’t support EDNS0 or TCP, such as applications built using older versions of Alpine Linux, are still subject to the aforementioned truncation issue. To avoid these issues, ensure that your application is built using a DNS resolver library that supports EDNS0 or TCP-based DNS queries.

Brief history of OpenShift's Bufsize changes:

  1. During the development of OpenShift 4.8.0, we updated to 1232 bytes due to Bug - 1949361 and backported to 4.7 and 4.6. However, later on, 4.8.0 (in development), 4.7, and 4.6 were reverted back to 512 bytes due to Bug - 1966116.
  2. Also in OpenShift 4.8.0, we bumped CoreDNS to v1.8.1, and picked up a commit that forced DNS queries that did not have the DO Bit (DNSSEC) set to set bufsize as 2048 bytes despite 512 bytes being set in the configuration.
  3. In OpenShift 4.12.0, we fixed OCPBUGS-240 to limit all DNS queries, specifically queries that had DO Bit off, to what is configured in the configuration file (512 bytes) and we backported the fix to 4.11, 4.10, and 4.9.
  4. Now, this PR is changing bufsize to 1232 bytes.

Slack Thread

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants