Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

singleInFlight can cause all future DNS requests to fail #1449

Closed
jameshartig opened this issue Apr 9, 2023 · 2 comments · Fixed by #1454
Closed

singleInFlight can cause all future DNS requests to fail #1449

jameshartig opened this issue Apr 9, 2023 · 2 comments · Fixed by #1454

Comments

@jameshartig
Copy link
Collaborator

Currently singleInFlight only takes into account the message and not the server address. I see the appeal because servers should return the same answer so why not group them. However, if a server is down then this has the unintentional consequence of causing all future DNS requests to fail.

Contrived example code
package main

import (
	"log"
	"time"

	"github.com/miekg/dns"
)

func main() {
	server := &dns.Server{
		Addr: ":9999",
		Net:  "udp",
		Handler: dns.HandlerFunc(func(w dns.ResponseWriter, r *dns.Msg) {
			m := new(dns.Msg)
			m.SetRcode(r, dns.RcodeSuccess)
			w.WriteMsg(m)
		}),
	}

	cfg := dns.ClientConfig{
		Servers: []string{
			"127.0.0.1:9999",
			"127.0.0.1:100",
		},
		Timeout: 1,
	}

	// let's pretend that the first server is temporarily down for an upgrade or
	// restart by starting the ListenAndServer purposefully late...

	newMsg := func() *dns.Msg {
		m := new(dns.Msg)
		m.SetQuestion("test.test.", dns.TypeA)
		return m
	}

	c := new(dns.Client)
	c.SingleInflight = true
	timeout := time.Duration(cfg.Timeout) * time.Second
	c.DialTimeout = timeout
	c.ReadTimeout = timeout
	c.WriteTimeout = timeout

	query := func() {
		for _, srv := range cfg.Servers {
			_, _, err := c.Exchange(newMsg(), srv)
			if err == nil {
				log.Printf("success from %v\n", srv)
				return
			}
			log.Printf("error from %v: %v\n", srv, err)
		}
	}

	go query()

	go func() {
		panic(server.ListenAndServe())
	}()

	for range time.Tick(500 * time.Millisecond) {
		go query()
	}
}

When running the above code you'll see:

2023/04/08 21:14:32 error from 127.0.0.1:9999: read udp 127.0.0.1:52618->127.0.0.1:9999: i/o timeout
2023/04/08 21:14:32 error from 127.0.0.1:9999: read udp 127.0.0.1:52618->127.0.0.1:9999: i/o timeout
2023/04/08 21:14:32 error from 127.0.0.1:9999: read udp 127.0.0.1:52618->127.0.0.1:9999: i/o timeout
2023/04/08 21:14:33 error from 127.0.0.1:100: read udp 127.0.0.1:52621->127.0.0.1:100: i/o timeout
2023/04/08 21:14:33 error from 127.0.0.1:100: read udp 127.0.0.1:52621->127.0.0.1:100: i/o timeout
2023/04/08 21:14:33 error from 127.0.0.1:9999: read udp 127.0.0.1:52621->127.0.0.1:100: i/o timeout
2023/04/08 21:14:33 error from 127.0.0.1:100: read udp 127.0.0.1:52621->127.0.0.1:100: i/o timeout
2023/04/08 21:14:33 error from 127.0.0.1:9999: read udp 127.0.0.1:52621->127.0.0.1:100: i/o timeout
2023/04/08 21:14:34 error from 127.0.0.1:9999: read udp 127.0.0.1:60130->127.0.0.1:100: i/o timeout
2023/04/08 21:14:34 error from 127.0.0.1:100: read udp 127.0.0.1:60130->127.0.0.1:100: i/o timeout
2023/04/08 21:14:34 error from 127.0.0.1:100: read udp 127.0.0.1:60130->127.0.0.1:100: i/o timeout
2023/04/08 21:14:34 error from 127.0.0.1:9999: read udp 127.0.0.1:60130->127.0.0.1:100: i/o timeout
2023/04/08 21:14:35 error from 127.0.0.1:9999: read udp 127.0.0.1:60134->127.0.0.1:100: i/o timeout
2023/04/08 21:14:35 error from 127.0.0.1:100: read udp 127.0.0.1:60134->127.0.0.1:100: i/o timeout
2023/04/08 21:14:35 error from 127.0.0.1:100: read udp 127.0.0.1:60134->127.0.0.1:100: i/o timeout
2023/04/08 21:14:35 error from 127.0.0.1:9999: read udp 127.0.0.1:60134->127.0.0.1:100: i/o timeout

Every query will forever fail despite only the second server being "down". This is because when it tries the first server a request to the second server is still timing out and then when it times out that goroutine tries the second server. The only way to get out of this is to wait for the second server to start working again or kill the program.

A simple fix would be to include the server's address in the group key. Alternatively, exchangeWithConnContext could check to see if the error returned from Do was for a different endpoint and immediately try again if so? But that could cause the Exchange function to take up to 2*Timeout which isn't ideal.

Finally, when SingleInflight is true, Exchange still calls Dial even if it will end up re-using an existing answer and just throwing away the connection without even using it. Any thoughts on tidying that up at the same time?

@tmthrgd
Copy link
Collaborator

tmthrgd commented Apr 9, 2023

I somewhat feel the whole SingleInflight code is too fragile and we might be better off removing it. It also has issues with contexts IIRC. It's also something callers can implement if they need it.

@miekg
Copy link
Owner

miekg commented Apr 10, 2023 via email

amery added a commit to darvaza-proxy/resolver that referenced this issue Dec 5, 2023
dns.Client.SingleInflight is a no-op.
See github.com/miekg/dns/issues/1449

Signed-off-by: Alejandro Mery <amery@jpi.io>
amery added a commit to darvaza-proxy/resolver that referenced this issue Dec 6, 2023
dns.Client.SingleInflight is a no-op.
See github.com/miekg/dns/issues/1449

Signed-off-by: Alejandro Mery <amery@jpi.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants