Thanos ruler fails to start with "cannot unmarshal DNS message" errors #4204

simonpasquier · 2021-05-07T12:10:59Z

Thanos, Prometheus and Golang version used:

Thanos v0.19.0 and Prometheus v2.26.0.

Object Storage Provider:

N/A

What happened:

The Thanos ruler pods fail to start because Thanos can't resolve the SRV records for Alertmanager endpoints (cannot unmarshal DNS message error message in the logs). The situation doesn't resolve by itself and the pods keep crashlooping.

What you expected to happen:

Thanos ruler pods .

How to reproduce it (as minimally and precisely as possible):

The issue can be seen with OpenShift 4.7 and 4.8 (at least) but I couldn't reproduce with vanilla Kubernetes v1.20 (installed with minikube).

Full logs to relevant components:

Logs

level=info ts=2021-04-26T09:29:07.621451887Z caller=head.go:668 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2021-04-26T09:29:07.62151742Z caller=head.go:682 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=3.122µs
level=info ts=2021-04-26T09:29:07.621536492Z caller=head.go:688 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2021-04-26T09:29:07.621793836Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=8
level=info ts=2021-04-26T09:29:07.622590298Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=1 maxSegment=8
level=info ts=2021-04-26T09:29:07.622718232Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=2 maxSegment=8
level=info ts=2021-04-26T09:29:07.622806952Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=3 maxSegment=8
level=info ts=2021-04-26T09:29:07.622910514Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=4 maxSegment=8
level=info ts=2021-04-26T09:29:07.623001901Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=5 maxSegment=8
level=info ts=2021-04-26T09:29:07.623086335Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=6 maxSegment=8
level=info ts=2021-04-26T09:29:07.623158403Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=7 maxSegment=8
level=info ts=2021-04-26T09:29:07.623226513Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=8 maxSegment=8
level=info ts=2021-04-26T09:29:07.623239198Z caller=head.go:745 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=54.077µs wal_replay_duration=1.637696ms total_replay_duration=1.724057ms
level=info ts=2021-04-26T09:29:07.624088245Z caller=rule.go:690 msg="a leftover lockfile found and removed"
level=info ts=2021-04-26T09:29:07.624727028Z caller=options.go:27 component=rules protocol=gRPC msg="enabling server side TLS"
level=info ts=2021-04-26T09:29:07.625803059Z caller=options.go:57 component=rules protocol=gRPC msg="server TLS client verification enabled"
level=info ts=2021-04-26T09:29:07.627861625Z caller=rule.go:672 component=rules msg="no supported bucket was configured, uploads will be disabled"
level=info ts=2021-04-26T09:29:07.627906584Z caller=rule.go:675 component=rules msg="starting rule node"
level=info ts=2021-04-26T09:29:07.628666449Z caller=intrumentation.go:60 component=rules msg="changing probe status" status=healthy
level=info ts=2021-04-26T09:29:07.628718519Z caller=http.go:62 component=rules service=http/server component=rule msg="listening for requests and metrics" address=localhost:10902
level=info ts=2021-04-26T09:29:07.628965519Z caller=rule.go:843 component=rules msg="reload rule files" numFiles=0
level=info ts=2021-04-26T09:29:07.629024444Z caller=intrumentation.go:48 component=rules msg="changing probe status" status=ready
level=info ts=2021-04-26T09:29:07.629177617Z caller=grpc.go:116 component=rules service=gRPC/server component=rule msg="listening for serving gRPC" address=0.0.0.0:10901
level=info ts=2021-04-26T09:29:07.630704234Z caller=manager.go:924 component=rules msg="Stopping rule manager..."
level=info ts=2021-04-26T09:29:07.630728465Z caller=manager.go:934 component=rules msg="Rule manager stopped"
level=info ts=2021-04-26T09:29:07.630738094Z caller=manager.go:924 component=rules msg="Stopping rule manager..."
level=info ts=2021-04-26T09:29:07.630745471Z caller=manager.go:934 component=rules msg="Rule manager stopped"
level=warn ts=2021-04-26T09:29:07.630756382Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-04-26T09:29:07.630772592Z caller=grpc.go:123 component=rules service=gRPC/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-04-26T09:29:07.630795263Z caller=grpc.go:136 component=rules service=gRPC/server component=rule msg="gracefully stopping internal server"
level=info ts=2021-04-26T09:29:07.630844746Z caller=grpc.go:149 component=rules service=gRPC/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=warn ts=2021-04-26T09:29:07.630863736Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-04-26T09:29:07.630874516Z caller=http.go:69 component=rules service=http/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-04-26T09:29:07.632064051Z caller=http.go:88 component=rules service=http/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-04-26T09:29:07.632104514Z caller=intrumentation.go:66 component=rules msg="changing probe status" status=not-healthy reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=error ts=2021-04-26T09:29:07.632180151Z caller=main.go:156 err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message\nrule command failed\nmain.main\n\t/go/src/github.com/improbable-eng/thanos/cmd/thanos/main.go:156\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371"

Anything else we need to know:

As far as I can tell, this isn't a Thanos bug by itself since the root cause is probably due to golang/go#36718 (TL;DR: the Go resolver rejects DNS responses that don't comply exactly with the RFCs, e.g. compressed names in SRV resource records). It's also worth noting that standard tools like dig have no issue to resolve the records.
We've worked around the issue by switching the DNS resolver to miekgdns (instead of the default native Go resolver) but it might be interesting to use the miekgdns resolver as the default?

Additional details: https://bugzilla.redhat.com/show_bug.cgi?id=1953518
More details

The text was updated successfully, but these errors were encountered:

wiardvanrij · 2021-05-09T10:38:49Z

Thanks for your report. I assume you use the hidden flag query.sd-dns-resolver? Your suggestion is to make miekgdns the default?

I would be in favor of not changing the default DNS resolver. Many users have not changed the resolver and perhaps changing the default would mean new issues for others. In which they have to change the new default to their previous used resolver.

However, it might be valuable to unhide the flag and make this more known that this is a thing. Especially for Openshift users.

Would love to hear your thoughts 👍

simonpasquier · 2021-05-10T09:05:30Z

Thanks for your report. I assume you use the hidden flag query.sd-dns-resolver? Your suggestion is to make miekgdns the default?

Yes, my rationale is that currently the Go native resolver has known limitations with SRV records (as stated in #1015 already) and that people deploying Thanos on Kubernetes probably need to use miekgdns as the default. Having said that, I fully understand that changing the default might trigger issues for anoother fraction of users.

However, it might be valuable to unhide the flag and make this more known that this is a thing. Especially for Openshift users.

That would definitely work for me. FWIW we had to switch the default resolver to miekgdns in OpenShift because the prometheus operator has no support for the sd-dns-resolver flag. Making the flag public would at least make it possible to expose it in the ThanosRuler CRD.

wiardvanrij · 2021-05-11T12:35:58Z

Just describing a the two options that IMO should be discussed:

Question 1 is:

Should miekgdns be the default dns resolver in favour of golang resolver ?

Pro's: Seems like better integration, less quirks/buggy
Con's: Might cause 'breaking changes' (shouldn't be tho..) for users

Question 2:

Should we keep sd-dns-resolve hidden?

Pro's: Less config overhead for regular usage
Con's: Given that there are some quirks with golang, one could wonder if we shouldn't be more informative of this option and maybe even pro-actively suggest miekgdns?

simonpasquier · 2021-06-08T10:04:24Z

We've finally identified the change that caused the issue with the Thanos ruler pods. It was the cluster DNS operator that enabled the bufsize plugin for CoreDNS. As a result, the server started to send UDP messages > 512 bytes which the native Go resolver can't handle (see openshift/cluster-dns-operator#266, golang/go#6464 and golang/go#13561). The plugin configuration was eventually modified to avoid breaking clients using the native Go resolver in openshift/cluster-dns-operator#276 but I bet that other DNS server implementations (e.g. Bind, PowerDNS) may also be configured in a way that would irritate the native Go resolver.

bwplotka · 2021-07-12T09:23:14Z

Nice, thanks for this!

Just to recap: Miekg resolver was working fine for everyone, but Go native not, right?

In this case, I would indeed unhide it and make it a default TBH - it has proven to work well for many. We can actually make even better use of it e.g finally use TTL metadata.

simonpasquier · 2021-07-12T13:28:06Z

Just to recap: Miekg resolver was working fine for everyone, but Go native not, right?

This is my understanding. I'd be 👍 on your idea @bwplotka

stale · 2021-09-11T00:06:01Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

simonpasquier · 2021-09-22T11:31:30Z

Closed by #4519

Related to issue: #4204 Problem: in the PR #4519 the default sd-dns-resolver for querier was set to miekgdns but this change was never ported to ruler Solution: this PR brings this default to ruler as well to make it consistent Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com> Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>

Related to issue: thanos-io#4204 Problem: in the PR thanos-io#4519 the default sd-dns-resolver for querier was set to miekgdns but this change was never ported to ruler Solution: this PR brings this default to ruler as well to make it consistent Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com> Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>

wiardvanrij added the feature request/improvement label May 11, 2021

simonpasquier mentioned this issue May 17, 2021

Bug 1961158: Changing resolver to address thanos-ruler pod issues openshift/thanos#57

Merged

2 tasks

yeya24 mentioned this issue May 26, 2021

thanos-store-shard-0 pod crash in OCP 4.8 #4251

Closed

clyang82 mentioned this issue Aug 4, 2021

Switch to use Miekg by default #4519

Merged

2 tasks

stale bot added the stale label Sep 11, 2021

simonpasquier closed this as completed Sep 22, 2021

JoaoBraveCoding mentioned this issue Dec 23, 2022

Sets the default sd-dns-resolver for ruler miekgdns #5997

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos ruler fails to start with "cannot unmarshal DNS message" errors #4204

Thanos ruler fails to start with "cannot unmarshal DNS message" errors #4204

simonpasquier commented May 7, 2021

wiardvanrij commented May 9, 2021

simonpasquier commented May 10, 2021

wiardvanrij commented May 11, 2021 •

edited

Loading

simonpasquier commented Jun 8, 2021

bwplotka commented Jul 12, 2021

simonpasquier commented Jul 12, 2021

stale bot commented Sep 11, 2021

simonpasquier commented Sep 22, 2021

Thanos ruler fails to start with "cannot unmarshal DNS message" errors #4204

Thanos ruler fails to start with "cannot unmarshal DNS message" errors #4204

Comments

simonpasquier commented May 7, 2021

wiardvanrij commented May 9, 2021

simonpasquier commented May 10, 2021

wiardvanrij commented May 11, 2021 • edited Loading

simonpasquier commented Jun 8, 2021

bwplotka commented Jul 12, 2021

simonpasquier commented Jul 12, 2021

stale bot commented Sep 11, 2021

simonpasquier commented Sep 22, 2021

wiardvanrij commented May 11, 2021 •

edited

Loading