Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos ruler fails to start with "cannot unmarshal DNS message" errors #4204

Closed
simonpasquier opened this issue May 7, 2021 · 8 comments
Closed

Comments

@simonpasquier
Copy link
Contributor

Thanos, Prometheus and Golang version used:

Thanos v0.19.0 and Prometheus v2.26.0.

Object Storage Provider:

N/A

What happened:

The Thanos ruler pods fail to start because Thanos can't resolve the SRV records for Alertmanager endpoints (cannot unmarshal DNS message error message in the logs). The situation doesn't resolve by itself and the pods keep crashlooping.

What you expected to happen:

Thanos ruler pods .

How to reproduce it (as minimally and precisely as possible):

The issue can be seen with OpenShift 4.7 and 4.8 (at least) but I couldn't reproduce with vanilla Kubernetes v1.20 (installed with minikube).

Full logs to relevant components:

Logs

level=info ts=2021-04-26T09:29:07.621451887Z caller=head.go:668 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2021-04-26T09:29:07.62151742Z caller=head.go:682 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=3.122µs
level=info ts=2021-04-26T09:29:07.621536492Z caller=head.go:688 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2021-04-26T09:29:07.621793836Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=8
level=info ts=2021-04-26T09:29:07.622590298Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=1 maxSegment=8
level=info ts=2021-04-26T09:29:07.622718232Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=2 maxSegment=8
level=info ts=2021-04-26T09:29:07.622806952Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=3 maxSegment=8
level=info ts=2021-04-26T09:29:07.622910514Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=4 maxSegment=8
level=info ts=2021-04-26T09:29:07.623001901Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=5 maxSegment=8
level=info ts=2021-04-26T09:29:07.623086335Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=6 maxSegment=8
level=info ts=2021-04-26T09:29:07.623158403Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=7 maxSegment=8
level=info ts=2021-04-26T09:29:07.623226513Z caller=head.go:740 component=tsdb msg="WAL segment loaded" segment=8 maxSegment=8
level=info ts=2021-04-26T09:29:07.623239198Z caller=head.go:745 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=54.077µs wal_replay_duration=1.637696ms total_replay_duration=1.724057ms
level=info ts=2021-04-26T09:29:07.624088245Z caller=rule.go:690 msg="a leftover lockfile found and removed"
level=info ts=2021-04-26T09:29:07.624727028Z caller=options.go:27 component=rules protocol=gRPC msg="enabling server side TLS"
level=info ts=2021-04-26T09:29:07.625803059Z caller=options.go:57 component=rules protocol=gRPC msg="server TLS client verification enabled"
level=info ts=2021-04-26T09:29:07.627861625Z caller=rule.go:672 component=rules msg="no supported bucket was configured, uploads will be disabled"
level=info ts=2021-04-26T09:29:07.627906584Z caller=rule.go:675 component=rules msg="starting rule node"
level=info ts=2021-04-26T09:29:07.628666449Z caller=intrumentation.go:60 component=rules msg="changing probe status" status=healthy
level=info ts=2021-04-26T09:29:07.628718519Z caller=http.go:62 component=rules service=http/server component=rule msg="listening for requests and metrics" address=localhost:10902
level=info ts=2021-04-26T09:29:07.628965519Z caller=rule.go:843 component=rules msg="reload rule files" numFiles=0
level=info ts=2021-04-26T09:29:07.629024444Z caller=intrumentation.go:48 component=rules msg="changing probe status" status=ready
level=info ts=2021-04-26T09:29:07.629177617Z caller=grpc.go:116 component=rules service=gRPC/server component=rule msg="listening for serving gRPC" address=0.0.0.0:10901
level=info ts=2021-04-26T09:29:07.630704234Z caller=manager.go:924 component=rules msg="Stopping rule manager..."
level=info ts=2021-04-26T09:29:07.630728465Z caller=manager.go:934 component=rules msg="Rule manager stopped"
level=info ts=2021-04-26T09:29:07.630738094Z caller=manager.go:924 component=rules msg="Stopping rule manager..."
level=info ts=2021-04-26T09:29:07.630745471Z caller=manager.go:934 component=rules msg="Rule manager stopped"
level=warn ts=2021-04-26T09:29:07.630756382Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-04-26T09:29:07.630772592Z caller=grpc.go:123 component=rules service=gRPC/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-04-26T09:29:07.630795263Z caller=grpc.go:136 component=rules service=gRPC/server component=rule msg="gracefully stopping internal server"
level=info ts=2021-04-26T09:29:07.630844746Z caller=grpc.go:149 component=rules service=gRPC/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=warn ts=2021-04-26T09:29:07.630863736Z caller=intrumentation.go:54 component=rules msg="changing probe status" status=not-ready reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-04-26T09:29:07.630874516Z caller=http.go:69 component=rules service=http/server component=rule msg="internal server is shutting down" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-04-26T09:29:07.632064051Z caller=http.go:88 component=rules service=http/server component=rule msg="internal server is shutdown gracefully" err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=info ts=2021-04-26T09:29:07.632104514Z caller=intrumentation.go:66 component=rules msg="changing probe status" status=not-healthy reason="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message"
level=error ts=2021-04-26T09:29:07.632180151Z caller=main.go:156 err="lookup SRV records \"_web._tcp.alertmanager-operated.openshift-monitoring.svc\": lookup _web._tcp.alertmanager-operated.openshift-monitoring.svc on 172.30.0.10:53: cannot unmarshal DNS message\nrule command failed\nmain.main\n\t/go/src/github.com/improbable-eng/thanos/cmd/thanos/main.go:156\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1371"

Anything else we need to know:

As far as I can tell, this isn't a Thanos bug by itself since the root cause is probably due to golang/go#36718 (TL;DR: the Go resolver rejects DNS responses that don't comply exactly with the RFCs, e.g. compressed names in SRV resource records). It's also worth noting that standard tools like dig have no issue to resolve the records.
We've worked around the issue by switching the DNS resolver to miekgdns (instead of the default native Go resolver) but it might be interesting to use the miekgdns resolver as the default?

Additional details: https://bugzilla.redhat.com/show_bug.cgi?id=1953518
More details

@wiardvanrij
Copy link
Member

Thanks for your report. I assume you use the hidden flag query.sd-dns-resolver? Your suggestion is to make miekgdns the default?

I would be in favor of not changing the default DNS resolver. Many users have not changed the resolver and perhaps changing the default would mean new issues for others. In which they have to change the new default to their previous used resolver.

However, it might be valuable to unhide the flag and make this more known that this is a thing. Especially for Openshift users.

Would love to hear your thoughts 👍

@simonpasquier
Copy link
Contributor Author

Thanks for your report. I assume you use the hidden flag query.sd-dns-resolver? Your suggestion is to make miekgdns the default?

Yes, my rationale is that currently the Go native resolver has known limitations with SRV records (as stated in #1015 already) and that people deploying Thanos on Kubernetes probably need to use miekgdns as the default. Having said that, I fully understand that changing the default might trigger issues for anoother fraction of users.

However, it might be valuable to unhide the flag and make this more known that this is a thing. Especially for Openshift users.

That would definitely work for me. FWIW we had to switch the default resolver to miekgdns in OpenShift because the prometheus operator has no support for the sd-dns-resolver flag. Making the flag public would at least make it possible to expose it in the ThanosRuler CRD.

@wiardvanrij
Copy link
Member

wiardvanrij commented May 11, 2021

Just describing a the two options that IMO should be discussed:

Question 1 is:

Should miekgdns be the default dns resolver in favour of golang resolver ?

Pro's: Seems like better integration, less quirks/buggy
Con's: Might cause 'breaking changes' (shouldn't be tho..) for users

Question 2:

Should we keep sd-dns-resolve hidden?

Pro's: Less config overhead for regular usage
Con's: Given that there are some quirks with golang, one could wonder if we shouldn't be more informative of this option and maybe even pro-actively suggest miekgdns?

@simonpasquier
Copy link
Contributor Author

We've finally identified the change that caused the issue with the Thanos ruler pods. It was the cluster DNS operator that enabled the bufsize plugin for CoreDNS. As a result, the server started to send UDP messages > 512 bytes which the native Go resolver can't handle (see openshift/cluster-dns-operator#266, golang/go#6464 and golang/go#13561). The plugin configuration was eventually modified to avoid breaking clients using the native Go resolver in openshift/cluster-dns-operator#276 but I bet that other DNS server implementations (e.g. Bind, PowerDNS) may also be configured in a way that would irritate the native Go resolver.

@bwplotka
Copy link
Member

Nice, thanks for this!

Just to recap: Miekg resolver was working fine for everyone, but Go native not, right?

In this case, I would indeed unhide it and make it a default TBH - it has proven to work well for many. We can actually make even better use of it e.g finally use TTL metadata.

@simonpasquier
Copy link
Contributor Author

Just to recap: Miekg resolver was working fine for everyone, but Go native not, right?

This is my understanding. I'd be 👍 on your idea @bwplotka

@stale
Copy link

stale bot commented Sep 11, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 11, 2021
@simonpasquier
Copy link
Contributor Author

Closed by #4519

squat pushed a commit that referenced this issue Dec 23, 2022
Related to issue: #4204

Problem: in the PR #4519 the
default sd-dns-resolver for querier was set to miekgdns but this change
was never ported to ruler

Solution: this PR brings this default to ruler as well to make it
consistent

Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>

Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>
ngraham20 pushed a commit to ngraham20/thanos that referenced this issue Apr 17, 2023
Related to issue: thanos-io#4204

Problem: in the PR thanos-io#4519 the
default sd-dns-resolver for querier was set to miekgdns but this change
was never ported to ruler

Solution: this PR brings this default to ruler as well to make it
consistent

Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>

Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>
ngraham20 pushed a commit to ngraham20/thanos that referenced this issue Apr 17, 2023
Related to issue: thanos-io#4204

Problem: in the PR thanos-io#4519 the
default sd-dns-resolver for querier was set to miekgdns but this change
was never ported to ruler

Solution: this PR brings this default to ruler as well to make it
consistent

Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>

Signed-off-by: JoaoBraveCoding <jmarcal@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants