-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-22453: Fixed systemd-resolved's split dns config in OKD/FCOS #7634
OCPBUGS-22453: Fixed systemd-resolved's split dns config in OKD/FCOS #7634
Conversation
/test okd-e2e-agent-compact-ipv4 |
@JM1: This pull request references Jira Issue OCPBUGS-22453, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test okd-e2e-aws |
OCP requires DNS records api.<cluster_domain> and *.apps.\ <cluster_domain> to be externally resolvable (<cluster_domain> is <cluster_name>.<base_domain>). For SNO this list also includes DNS record api-int.<cluster_domain>. However, OCP does not enforce ownership of all subdomains of <cluster_domain>. For example, it is allowed to host a disconnected image registry at <registry_hostname>.<cluster_domain> and OCP shall be able to resolve it using the user-supplied external DNS resolver. PR openshift#7516 changed the systemd-resolved config of the bootstrap node / rendezvous host to associate the complete <cluster_domain> with the DNS server at 127.0.0.1 where CoreDNS is supposed to be listening. When a disconnected image registry is used for cluster installation, the registry is hosted at <registry_hostname>.<cluster_domain> and the bootstrap node / rendezvous host does not retrieve its domain from the DHCP server, then the registry's DNS name cannot be resolved. That is because in order to pull the CoreDNS image, the disconnected registry must be connected. The split dns mechanism of systemd-\ resolved would cause it to send DNS requests for <registry_hostname>.<cluster_domain> to 127.0.0.1 where CoreDNS is expected to be running which is not. When a bootstrap node / rendezvous host retrieves its domain <cluster_domain> from a DHCP server (e.g. dnsmasq's '--domain' option) then systemd-resolved would associate <cluster_domain> not only with 127.0.0.1 but also with the physical network interface, causing DNS requests for <registry_hostname>.<cluster_domain> to be send out to 127.0.0.1 as well as the external DNS resolver. This patch mitigates the DNS issue for other network setups. It changes the systemd-resolved config to forward DNS requests to CoreDNS only for domains which are resolvable by CoreDNS: * api.<cluster_domain> * api-int.<cluster_domain>. * apps.<cluster_domain> DNS requests for <registry_hostname>.<cluster_domain> and other subdomains of <cluster_domain> will be send out to the external DNS resolver. Fixes openshift#7516
036f0bf
to
5380ad9
Compare
@JM1: This pull request references Jira Issue OCPBUGS-22453, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@andfasano @vrutkovs @LorbusChris Updated commit message and first PR message. Ready for review :) |
Thank you! |
/test okd-e2e-aws-ovn |
/test okd-e2e-vsphere |
/test okd-e2e-aws-ovn |
/test okd-e2e-vsphere |
/lgtm |
/retest-required |
/test okd-scos-e2e-aws-ovn |
/assign @honza |
Note: this patch will allow to make green the okd agent jobs in #7484 |
/retest |
/test e2e-metal-single-node-live-iso |
@elfosardo e2e-metal-single-node-live-iso successfully deploys OCP but fails in step baremetalds-sno-test. Those failures are unrelated to this change. |
/test e2e-metal-single-node-live-iso |
/test e2e-metal-single-node-live-iso Job e2e-metal-single-node-live-iso fails in step |
@JM1: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Note: that specific kind of error is due the underlying CI infrastructure |
Job Again, this PR does NOT affect the OCP builds and tests. OCP uses RHCOS which does not even have systemd-resolved.service. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elfosardo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/label acknowledge-critical-fixes-only |
@JM1: Jira Issue OCPBUGS-22453: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-22453 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build ose-installer-altinfra-container-v4.15.0-202311201833.p0.gb0f314b.assembly.stream for distgit ose-installer-altinfra. |
OCP requires DNS records
api.<cluster_domain>
and*.apps.<cluster_domain>
to be externally resolvable (<cluster_domain>
is<cluster_name>.<base_domain>
). For SNO this list also includes DNS recordapi-int.<cluster_domain>
.However, OCP does not enforce ownership of all subdomains of
<cluster_domain>
. For example, it is allowed to host a disconnected image registry at<registry_hostname>.<cluster_domain>
and OCP shall be able to resolve it using the user-supplied external DNS resolver.PR #7516 changed the systemd-resolved config of the bootstrap node / rendezvous host to associate the complete
<cluster_domain>
with the DNS server at127.0.0.1
where CoreDNS is supposed to be listening.When a disconnected image registry is used for cluster installation, the registry is hosted at
<registry_hostname>.<cluster_domain>
and the bootstrap node / rendezvous host does not retrieve its domain from the DHCP server, then the registry's DNS name cannot be resolved. That is because in order to pull the CoreDNS image, the disconnected registry must be connected. The split dns mechanism of systemd-resolved would cause it to send DNS requests for<registry_hostname>.<cluster_domain>
to127.0.0.1
where CoreDNS is expected to be running which is not.When a bootstrap node / rendezvous host retrieves its domain
<cluster_domain>
from a DHCP server (e.g. dnsmasq's--domain
option) then systemd-resolved would associate<cluster_domain>
not only with127.0.0.1
but also with the physical network interface, causing DNS requests for<registry_hostname>.<cluster_domain>
to be send out to 127.0.0.1 as well as the external DNS resolver.This patch mitigates the DNS issue for other network setups. It changes the systemd-resolved config to forward DNS requests to CoreDNS only for domains which are resolvable by CoreDNS:
api.<cluster_domain>
api-int.<cluster_domain>
apps.<cluster_domain>
DNS requests for
<registry_hostname>.<cluster_domain>
and other subdomains of<cluster_domain>
will be send out to the external DNS resolver.Fixes #7516