OCPBUGS-22453: Fixed systemd-resolved's split dns config in OKD/FCOS #7634

JM1 · 2023-10-26T14:45:02Z

OCP requires DNS records api.<cluster_domain> and *.apps.<cluster_domain> to be externally resolvable (<cluster_domain> is <cluster_name>.<base_domain>). For SNO this list also includes DNS record api-int.<cluster_domain>.

However, OCP does not enforce ownership of all subdomains of <cluster_domain>. For example, it is allowed to host a disconnected image registry at <registry_hostname>.<cluster_domain> and OCP shall be able to resolve it using the user-supplied external DNS resolver.

PR #7516 changed the systemd-resolved config of the bootstrap node / rendezvous host to associate the complete <cluster_domain> with the DNS server at 127.0.0.1 where CoreDNS is supposed to be listening.

When a disconnected image registry is used for cluster installation, the registry is hosted at <registry_hostname>.<cluster_domain> and the bootstrap node / rendezvous host does not retrieve its domain from the DHCP server, then the registry's DNS name cannot be resolved. That is because in order to pull the CoreDNS image, the disconnected registry must be connected. The split dns mechanism of systemd-resolved would cause it to send DNS requests for <registry_hostname>.<cluster_domain> to 127.0.0.1 where CoreDNS is expected to be running which is not.

When a bootstrap node / rendezvous host retrieves its domain <cluster_domain> from a DHCP server (e.g. dnsmasq's --domain option) then systemd-resolved would associate <cluster_domain> not only with 127.0.0.1 but also with the physical network interface, causing DNS requests for <registry_hostname>.<cluster_domain> to be send out to 127.0.0.1 as well as the external DNS resolver.

This patch mitigates the DNS issue for other network setups. It changes the systemd-resolved config to forward DNS requests to CoreDNS only for domains which are resolvable by CoreDNS:

api.<cluster_domain>
api-int.<cluster_domain>
apps.<cluster_domain>

DNS requests for <registry_hostname>.<cluster_domain> and other subdomains of <cluster_domain> will be send out to the external DNS resolver.

Fixes #7516

JM1 · 2023-10-26T19:30:13Z

/test okd-e2e-agent-compact-ipv4
/test okd-e2e-agent-sno-ipv6
/test okd-scos-e2e-aws-ovn

openshift-ci-robot · 2023-10-26T19:49:03Z

@JM1: This pull request references Jira Issue OCPBUGS-22453, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.15.0) matches configured target version for branch (4.15.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Draft. Commit message will be updated later.

Follow up to #7516

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vrutkovs · 2023-10-27T06:31:45Z

/test okd-e2e-aws
/test okd-e2e-vsphere

OCP requires DNS records api.<cluster_domain> and *.apps.\ <cluster_domain> to be externally resolvable (<cluster_domain> is <cluster_name>.<base_domain>). For SNO this list also includes DNS record api-int.<cluster_domain>. However, OCP does not enforce ownership of all subdomains of <cluster_domain>. For example, it is allowed to host a disconnected image registry at <registry_hostname>.<cluster_domain> and OCP shall be able to resolve it using the user-supplied external DNS resolver. PR openshift#7516 changed the systemd-resolved config of the bootstrap node / rendezvous host to associate the complete <cluster_domain> with the DNS server at 127.0.0.1 where CoreDNS is supposed to be listening. When a disconnected image registry is used for cluster installation, the registry is hosted at <registry_hostname>.<cluster_domain> and the bootstrap node / rendezvous host does not retrieve its domain from the DHCP server, then the registry's DNS name cannot be resolved. That is because in order to pull the CoreDNS image, the disconnected registry must be connected. The split dns mechanism of systemd-\ resolved would cause it to send DNS requests for <registry_hostname>.<cluster_domain> to 127.0.0.1 where CoreDNS is expected to be running which is not. When a bootstrap node / rendezvous host retrieves its domain <cluster_domain> from a DHCP server (e.g. dnsmasq's '--domain' option) then systemd-resolved would associate <cluster_domain> not only with 127.0.0.1 but also with the physical network interface, causing DNS requests for <registry_hostname>.<cluster_domain> to be send out to 127.0.0.1 as well as the external DNS resolver. This patch mitigates the DNS issue for other network setups. It changes the systemd-resolved config to forward DNS requests to CoreDNS only for domains which are resolvable by CoreDNS: * api.<cluster_domain> * api-int.<cluster_domain>. * apps.<cluster_domain> DNS requests for <registry_hostname>.<cluster_domain> and other subdomains of <cluster_domain> will be send out to the external DNS resolver. Fixes openshift#7516

openshift-ci-robot · 2023-10-27T08:16:09Z

@JM1: This pull request references Jira Issue OCPBUGS-22453, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.15.0) matches configured target version for branch (4.15.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

In response to this:

OCP requires DNS records api.<cluster_domain> and *.apps.
<cluster_domain> to be externally resolvable (<cluster_domain> is
<cluster_name>.<base_domain>). For SNO this list also includes DNS
record api-int.<cluster_domain>.

However, OCP does not enforce ownership of all subdomains of
<cluster_domain>. For example, it is allowed to host a disconnected
image registry at <registry_hostname>.<cluster_domain> and OCP shall
be able to resolve it using the user-supplied external DNS resolver.

PR #7516 changed the systemd-resolved config of the bootstrap node /
rendezvous host to associate the complete <cluster_domain> with the
DNS server at 127.0.0.1 where CoreDNS is supposed to be listening.

When a disconnected image registry is used for cluster installation,
the registry is hosted at <registry_hostname>.<cluster_domain> and
the bootstrap node / rendezvous host does not retrieve its domain
from the DHCP server, then the registry's DNS name cannot be
resolved.
That is because in order to pull the CoreDNS image, the disconnected
registry must be connected. The split dns mechanism of systemd-
resolved would cause it to send DNS requests for
<registry_hostname>.<cluster_domain> to 127.0.0.1 where CoreDNS is
expected to be running which is not.

When a bootstrap node / rendezvous host retrieves its domain
<cluster_domain> from a DHCP server (e.g. dnsmasq's '--domain'
option) then systemd-resolved would associate <cluster_domain> not
only with 127.0.0.1 but also with the physical network interface,
causing DNS requests for <registry_hostname>.<cluster_domain> to be
send out to 127.0.0.1 as well as the external DNS resolver.

This patch mitigates the DNS issue for other network setups. It
changes the systemd-resolved config to forward DNS requests to
CoreDNS only for domains which are resolvable by CoreDNS:

api.<cluster_domain>

api-int.<cluster_domain>.

apps.<cluster_domain>

DNS requests for <registry_hostname>.<cluster_domain> and other
subdomains of <cluster_domain> will be send out to the external
DNS resolver.

Fixes #7516

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JM1 · 2023-10-27T08:19:26Z

@andfasano @vrutkovs @LorbusChris Updated commit message and first PR message. Ready for review :)

vrutkovs · 2023-10-27T08:23:42Z

Thank you!
/lgtm

vrutkovs · 2023-10-27T08:39:19Z

/test okd-e2e-aws-ovn
/test okd-e2e-vsphere
/test okd-scos-e2e-aws-ovn
/test okd-scos-e2e-vsphere

vrutkovs · 2023-10-27T08:47:44Z

/test okd-e2e-vsphere
/test okd-scos-e2e-vsphere

vrutkovs · 2023-10-27T09:45:16Z

/test okd-e2e-aws-ovn
/test okd-e2e-vsphere
/test okd-scos-e2e-aws-ovn
/test okd-scos-e2e-vsphere

vrutkovs · 2023-10-27T12:12:14Z

/test okd-e2e-vsphere
/test okd-scos-e2e-aws-ovn
/test okd-scos-e2e-vsphere

LorbusChris · 2023-10-27T13:01:53Z

/lgtm
although another PR is required before SCOS will go green: #7636

JM1 · 2023-11-11T13:27:07Z

/retest-required

LorbusChris · 2023-11-13T09:52:13Z

/test okd-scos-e2e-aws-ovn

LorbusChris · 2023-11-13T09:52:41Z

/assign @honza
for approval

andfasano · 2023-11-14T09:24:54Z

Note: this patch will allow to make green the okd agent jobs in #7484

elfosardo · 2023-11-14T09:43:43Z

/retest

elfosardo · 2023-11-14T13:46:26Z

/test e2e-metal-single-node-live-iso

JM1 · 2023-11-15T09:11:45Z

@elfosardo e2e-metal-single-node-live-iso successfully deploys OCP but fails in step baremetalds-sno-test. Those failures are unrelated to this change.

andfasano · 2023-11-15T17:53:42Z

/test e2e-metal-single-node-live-iso

JM1 · 2023-11-16T09:39:27Z

/test e2e-metal-single-node-live-iso

Job e2e-metal-single-node-live-iso fails in step baremetalds-packet-setup with baremetalds: Failed to create equinix device: ipi-ci-op-***. Why exactly is this job relevant for this PR? This OCP job does not touch the code of this PR..

openshift-ci · 2023-11-16T11:40:26Z

@JM1: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-e2e-agent-compact-ipv4	036f0bf47dc3954faf3cdca78674509476b55658	link	false	`/test okd-e2e-agent-compact-ipv4`
ci/prow/okd-e2e-agent-sno-ipv6	036f0bf47dc3954faf3cdca78674509476b55658	link	false	`/test okd-e2e-agent-sno-ipv6`
ci/prow/okd-e2e-vsphere	`5380ad9`	link	false	`/test okd-e2e-vsphere`
ci/prow/okd-scos-e2e-vsphere	`5380ad9`	link	false	`/test okd-scos-e2e-vsphere`
ci/prow/e2e-metal-single-node-live-iso	`5380ad9`	link	false	`/test e2e-metal-single-node-live-iso`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

andfasano · 2023-11-16T11:40:57Z

/test e2e-metal-single-node-live-iso

Job e2e-metal-single-node-live-iso fails in step baremetalds-packet-setup with baremetalds: Failed to create equinix device: ipi-ci-op-***. Why exactly is this job relevant for this PR? This OCP job does not touch the code of this PR..

Note: that specific kind of error is due the underlying CI infrastructure

JM1 · 2023-11-17T08:05:03Z

Job e2e-metal-single-node-live-iso fails again in e2e tests which are unrelated to this change.

Again, this PR does NOT affect the OCP builds and tests. OCP uses RHCOS which does not even have systemd-resolved.service.

elfosardo · 2023-11-20T11:05:40Z

/approve

openshift-ci · 2023-11-20T11:06:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elfosardo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~data/data/bootstrap/baremetal/OWNERS~~ [elfosardo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JM1 · 2023-11-20T12:21:47Z

/label acknowledge-critical-fixes-only

openshift-ci-robot · 2023-11-20T16:06:32Z

@JM1: Jira Issue OCPBUGS-22453: All pull requests linked via external trackers have merged:

openshift/installer#7634

Jira Issue OCPBUGS-22453 has been moved to the MODIFIED state.

In response to this:

OCP requires DNS records api.<cluster_domain> and *.apps.<cluster_domain> to be externally resolvable (<cluster_domain> is <cluster_name>.<base_domain>). For SNO this list also includes DNS record api-int.<cluster_domain>.

However, OCP does not enforce ownership of all subdomains of <cluster_domain>. For example, it is allowed to host a disconnected image registry at <registry_hostname>.<cluster_domain> and OCP shall be able to resolve it using the user-supplied external DNS resolver.

PR #7516 changed the systemd-resolved config of the bootstrap node / rendezvous host to associate the complete <cluster_domain> with the DNS server at 127.0.0.1 where CoreDNS is supposed to be listening.

When a disconnected image registry is used for cluster installation, the registry is hosted at <registry_hostname>.<cluster_domain> and the bootstrap node / rendezvous host does not retrieve its domain from the DHCP server, then the registry's DNS name cannot be resolved. That is because in order to pull the CoreDNS image, the disconnected registry must be connected. The split dns mechanism of systemd-resolved would cause it to send DNS requests for <registry_hostname>.<cluster_domain> to 127.0.0.1 where CoreDNS is expected to be running which is not.

When a bootstrap node / rendezvous host retrieves its domain <cluster_domain> from a DHCP server (e.g. dnsmasq's --domain option) then systemd-resolved would associate <cluster_domain> not only with 127.0.0.1 but also with the physical network interface, causing DNS requests for <registry_hostname>.<cluster_domain> to be send out to 127.0.0.1 as well as the external DNS resolver.

This patch mitigates the DNS issue for other network setups. It changes the systemd-resolved config to forward DNS requests to CoreDNS only for domains which are resolvable by CoreDNS:

api.<cluster_domain>

api-int.<cluster_domain>

apps.<cluster_domain>

DNS requests for <registry_hostname>.<cluster_domain> and other subdomains of <cluster_domain> will be send out to the external DNS resolver.

Fixes #7516

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2023-11-20T20:27:29Z

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-installer-altinfra-container-v4.15.0-202311201833.p0.gb0f314b.assembly.stream for distgit ose-installer-altinfra.
All builds following this will include this PR.

openshift-ci bot requested review from elfosardo and honza October 26, 2023 14:50

JM1 mentioned this pull request Oct 26, 2023

OCPBUGS-19303: Changed OKD/FCOS workaround to also support Agent-based Installer #7484

Merged

JM1 changed the title ~~[DNM] Fixed systemd-resolved's split dns config in OKD/FCOS~~ OCPBUGS-22453: Fixed systemd-resolved's split dns config in OKD/FCOS Oct 26, 2023

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Oct 26, 2023

openshift-ci bot requested a review from gpei October 26, 2023 19:49

JM1 force-pushed the okd-split-dns-fix-follow-up branch from 036f0bf to 5380ad9 Compare October 27, 2023 08:14

openshift-ci bot assigned vrutkovs Oct 27, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2023

openshift-ci bot assigned LorbusChris Oct 27, 2023

JM1 mentioned this pull request Oct 27, 2023

[DNM] OKD: Combined test of PR #7484 and PR #7634 #7641

Closed

openshift-ci bot assigned honza Nov 13, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2023

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Nov 20, 2023

openshift-merge-bot bot merged commit b0f314b into openshift:master Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-22453: Fixed systemd-resolved's split dns config in OKD/FCOS #7634

OCPBUGS-22453: Fixed systemd-resolved's split dns config in OKD/FCOS #7634

JM1 commented Oct 26, 2023 •

edited

Loading

JM1 commented Oct 26, 2023

openshift-ci-robot commented Oct 26, 2023

vrutkovs commented Oct 27, 2023

openshift-ci-robot commented Oct 27, 2023

JM1 commented Oct 27, 2023

vrutkovs commented Oct 27, 2023

vrutkovs commented Oct 27, 2023

vrutkovs commented Oct 27, 2023

vrutkovs commented Oct 27, 2023

vrutkovs commented Oct 27, 2023

LorbusChris commented Oct 27, 2023

JM1 commented Nov 11, 2023

LorbusChris commented Nov 13, 2023

LorbusChris commented Nov 13, 2023

andfasano commented Nov 14, 2023

elfosardo commented Nov 14, 2023

elfosardo commented Nov 14, 2023

JM1 commented Nov 15, 2023

andfasano commented Nov 15, 2023

JM1 commented Nov 16, 2023

openshift-ci bot commented Nov 16, 2023

andfasano commented Nov 16, 2023

JM1 commented Nov 17, 2023

elfosardo commented Nov 20, 2023

openshift-ci bot commented Nov 20, 2023

JM1 commented Nov 20, 2023

openshift-ci-robot commented Nov 20, 2023

openshift-bot commented Nov 20, 2023

OCPBUGS-22453: Fixed systemd-resolved's split dns config in OKD/FCOS #7634

OCPBUGS-22453: Fixed systemd-resolved's split dns config in OKD/FCOS #7634

Conversation

JM1 commented Oct 26, 2023 • edited Loading

JM1 commented Oct 26, 2023

openshift-ci-robot commented Oct 26, 2023

vrutkovs commented Oct 27, 2023

openshift-ci-robot commented Oct 27, 2023

JM1 commented Oct 27, 2023

vrutkovs commented Oct 27, 2023

vrutkovs commented Oct 27, 2023

vrutkovs commented Oct 27, 2023

vrutkovs commented Oct 27, 2023

vrutkovs commented Oct 27, 2023

LorbusChris commented Oct 27, 2023

JM1 commented Nov 11, 2023

LorbusChris commented Nov 13, 2023

LorbusChris commented Nov 13, 2023

andfasano commented Nov 14, 2023

elfosardo commented Nov 14, 2023

elfosardo commented Nov 14, 2023

JM1 commented Nov 15, 2023

andfasano commented Nov 15, 2023

JM1 commented Nov 16, 2023

openshift-ci bot commented Nov 16, 2023

andfasano commented Nov 16, 2023

JM1 commented Nov 17, 2023

elfosardo commented Nov 20, 2023

openshift-ci bot commented Nov 20, 2023

JM1 commented Nov 20, 2023

openshift-ci-robot commented Nov 20, 2023

openshift-bot commented Nov 20, 2023

JM1 commented Oct 26, 2023 •

edited

Loading