Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split dns on talos machine config #7287

Open
btrepp opened this issue May 28, 2023 · 15 comments
Open

Split dns on talos machine config #7287

btrepp opened this issue May 28, 2023 · 15 comments

Comments

@btrepp
Copy link

btrepp commented May 28, 2023

Feature Request

Allow configuring certain domains to be forwarded to other DNS resolvers.

Description

I've been developing a Tailscale extension to allow talos nodes to have Tailscale IPs (and the long term goal is to talk to backend services such as storage, over a Tailscale network).

siderolabs/extensions#154

One of the issues is that it would be great to uses tail scales magic dns, so you can do things like 'nas' in your config files and dns will point you to the correct Tailscale machine.

Tailscale includes this, however it tries to write over /etc/resolv.conf. This works great if I bind mount it, but when things go wrong, they go really wrong.

  • Ideally a feature would be being able to configure this on the Machine Config files, so that talos is in control of DNS.
  • A workaround might be running a DNS server as an extension, and configuring machine configs to forward to this... much like how Tailscale runs, but, if this container is stopped, dns would stop, which is the path for upgrades currently (stop all services, pull images). Which wouldn't work.
  • The other option might be being able to mark some services extensions as critical for networking, so they get rebooted/stopped at different times, in order to still be able to perform the update.

Current workaround

At the moment you can run a DNS server externally and configure how you wish, but it does become more external infrastructure you need to maintain. Alternatively you can use your Tailscale IPs directly, but then you do have to make sure the IPs are aligned (and if talos wipes a disk, you are getting a new IP from Tailscale).

@smira
Copy link
Member

smira commented May 29, 2023

Long-term I feel we should have system extensions which are critical and run always, and probably have a way to override/inject values into resolv.conf, but many pieces are missing at the moment.

For the registry endpoint, you can use registry mirror config to resolve it to a Tailscale IP, as these are assigned in a static way.

@michaelbeaumont
Copy link

michaelbeaumont commented Aug 21, 2023

@btrepp Maybe you can clear up my confusion.. I appear to be able to use Split DNS with the extension. However, I'm running Talos in a VM on a host machine that is itself part of the tailnet. Could this be the reason Split DNS works, because DNS queries are forwarded outside of the VM to the host's DNS, which is configured with Split DNS?

Search Domains is the feature that fails, presumably because it requires edits to /etc/resolv.conf, even if it's running in said VM.

I create CP nodes named cp-0 with the tailscale extension and set the Kubernetes endpoint to be cp.ts. I've got CoreDNS running outside of Talos configured to answer with a CNAME pointing to cp-0.my-tailnet.ts.net when queried for cp.ts. This CoreDNS is configured for .ts using Split DNS. Everything seems to work... Is it going to go horribly wrong at some point, assuming I keep the VM on a host in the tailnet?

It's when I configure Search Domains for ts and use cp as the Kubernetes endpoint that something seems wrong, namely that although everything seems Healthy and the node is Ready, the node can't reach the API server at cp. Perhaps I could even configure libvirt's dnsmasq to include the search domain...

@btrepp
Copy link
Author

btrepp commented Aug 21, 2023 via email

Copy link

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jun 29, 2024
@michaelbeaumont
Copy link

This would definitely still be a great feature!

@github-actions github-actions bot removed the Stale label Jun 30, 2024
@rgl
Copy link
Contributor

rgl commented Aug 15, 2024

now that host-dns exists, maybe this is now possible to implement?

@smira
Copy link
Member

smira commented Aug 15, 2024

It should work in main now with the Tailscale DNS endpoint being the first entry in nameservers and your recursive DNS resolver being the second.

@rgl
Copy link
Contributor

rgl commented Aug 15, 2024

does that mean that Allow configuring certain domains to be forwarded to other DNS resolvers. is in main already (and not tied to tailscale)?

@smira
Copy link
Member

smira commented Aug 15, 2024

I don't know what you're talking about, sorry. I have no idea about Tailscale, all I said is that split DNS should work in main now.

@rgl
Copy link
Contributor

rgl commented Aug 15, 2024

I do not known about tailscale either, since you were the one mentioning it, I wanted to clarify whether this feature was tied to tailscale. By your answer, I will assume, it's not tied to tailscale. :-)

How do I configure this? The 1.8 docs at https://www.talos.dev/v1.8/talos-guides/network/host-dns/ do not seem to mention how to configure this feature.

@smira
Copy link
Member

smira commented Aug 15, 2024

There is no feature at all, it will just correctly iterate over nameservers configured in case if one returns NXDOMAIN/SERVFAIL.

@michaelbeaumont
Copy link

@smira AFAICT this doesn't happen with NXDOMAIN

if resp != nil && (resp.Rcode == dns.RcodeServerFailure || resp.Rcode == dns.RcodeRefused) {
assuming we're talking about #9179

Is there anything standing in the way of just switching to coredns for node DNS as a separate service?

It's not possible to workaround this either because the order of resolvers doesn't appear to be totally under the users control:

upstreams, err := safe.ReaderListAll[*network.DNSUpstream](ctx, r)
if err != nil {
return fmt.Errorf("error getting resolver status: %w", err)
}
addrs, prxs := make([]string, 0, upstreams.Len()), make([]*proxy.Proxy, 0, upstreams.Len())
for it := upstreams.Iterator(); it.Next(); {
prx := it.Value().TypedSpec().Value.Prx
addrs = append(addrs, prx.Addr())
prxs = append(prxs, prx.(*proxy.Proxy)) //nolint:forcetypeassert
}
if ctrl.handler.SetProxy(prxs) {

My router DNS seems to always show up first in the list, probably because it comes from DHCP before the machine config is applied.

@smira
Copy link
Member

smira commented Sep 4, 2024

I believe DNS server shouldn't return NXDOMAIN if it doesn't know about the domain, so the DNS server is wrong (if I'm wrong, easy to fix).

The DNS servers on initial boot before machine config is applied can be controlled via kernel cmdline, but the machine config overwrites any DNS servers configured by other means.

@michaelbeaumont
Copy link

michaelbeaumont commented Sep 4, 2024

I believe DNS server shouldn't return NXDOMAIN if it doesn't know about the domain, so the DNS server is wrong (if I'm wrong, easy to fix).

I do agree, just wanted to make it clear it doesn't work with NXDOMAIN, only SERVFAIL.

I think the issue is that Tailscale uses <machine-name>.<network-name>.ts.net as FQDNs but only returns records on its network-internal resolver. Since .ts.net is a real domain, Cloudflare, for example, will return NXDOMAIN. But the network-internal resolver returns the machine IP on the TS overlay network.

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 1
;; AUTHORITY SECTION:
ts.net.			300	IN	SOA	ns1.dnsimple.com. admin.dnsimple.com.

;; Query time: 20 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2
;; ANSWER SECTION:
my-machine.my-network.ts.net. 600	IN	A	100.90.80.70

;; Query time: 0 msec
;; SERVER: 100.100.100.100#53(100.100.100.100) (UDP)

The DNS servers on initial boot before machine config is applied can be controlled via kernel cmdline, but the machine config overwrites any DNS servers configured by other means.

It doesn't, from my testing.

EDIT: removed irrelevant code refs

What I see:

❯ talosctl get resolverspec -o yaml
metadata:
    namespace: network
    type: ResolverSpecs.net.talos.dev
    id: resolvers
spec:
    dnsServers:
        - fd7a:115c:a1e0::53
        - 192.168.0.1
    layer: configuration
$ dig @fd7a:115c:a1e0::53 my-machine.my-network.ts.net
my-machine.my-network.ts.net. 600	IN	A	100.90.80.70
$ dig @169.254.116.108 my-machine.my-network.ts.net
ts.net.			10	IN	SOA	ns1.dnsimple.com. admin.dnsimple.com.
$ dig @192.168.0.1 my-machine.my-network.ts.net
ts.net.			10	IN	SOA	ns1.dnsimple.com. admin.dnsimple.com.

@smira
Copy link
Member

smira commented Sep 4, 2024

Probably it makes sense to create issues with full description for both, as I don't quite understand your case.

Your tailnet resolver should come before CloudFlare one.

DNS servers should be completely changeable with meachine config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants