Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/envoy: Set the dns lookup family on externalName type clusters #2894

Merged
merged 2 commits into from
Sep 23, 2020

Conversation

stevesloka
Copy link
Member

@stevesloka stevesloka commented Sep 9, 2020

Adds a config file option Cluster.DnsLookupFamily which allows users to define what dns lookup family is used for any
externalName type cluster. This ensures that lookups to external resources are resolved correctly.

Fixes #2873

Signed-off-by: Steve Sloka slokas@vmware.com

@stevesloka stevesloka added this to the 1.9.0 milestone Sep 9, 2020
@codecov
Copy link

codecov bot commented Sep 9, 2020

Codecov Report

Merging #2894 into main will increase coverage by 0.02%.
The diff coverage is 81.81%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2894      +/-   ##
==========================================
+ Coverage   74.91%   74.94%   +0.02%     
==========================================
  Files          87       87              
  Lines        5582     5604      +22     
==========================================
+ Hits         4182     4200      +18     
- Misses       1310     1314       +4     
  Partials       90       90              
Impacted Files Coverage Δ
cmd/contour/serve.go 1.83% <0.00%> (-0.03%) ⬇️
internal/dag/dag.go 96.84% <ø> (ø)
cmd/contour/servecontext.go 90.24% <100.00%> (+0.86%) ⬆️
internal/dag/httpproxy_processor.go 94.24% <100.00%> (+0.01%) ⬆️
internal/envoy/v2/cluster.go 100.00% <100.00%> (ø)

@jpeach
Copy link
Contributor

jpeach commented Sep 9, 2020

@stevesloka Can you give more detail about why v4-only is the right approach here? What was the underlying problem?

@stevesloka
Copy link
Member Author

@jpeach the issue describes the problem in more detail.

The TLDR is: When configuring a route to an externalName service, requests don't get fulfilled properly, but return 503 errors. When Envoy does a DNS lookup to some external resources, a ipv6 address is returned for the external resource and proxying fails with a 503 error. Setting this field ensures an IPv4 is returned so Envoy proxies correctly. I tested this in a GKE cluster as well as my home lab, both showed the same issue and was also resolved with the same fix.

Chatting with @moderation in an Envoy issue (envoyproxy/envoy#13037 (comment)), he verified that v3 resolved the issue which we haven't yet moved to, or by setting the dns lookup family might also help.

@skriss
Copy link
Member

skriss commented Sep 10, 2020

Based on https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/cluster.proto#envoy-api-enum-cluster-dnslookupfamily, it sounds to me like this should be working properly (i.e. fallback to v4 resolution if v6 fails), so perhaps an Envoy bug?

I guess my only question re: this change is, are there any scenarios where specifying v4-only would cause issues for users by not supporting v6?

@jpeach
Copy link
Contributor

jpeach commented Sep 10, 2020

@jpeach the issue describes the problem in more detail.

The TLDR is: When configuring a route to an externalName service, requests don't get fulfilled properly, but return 503 errors. When Envoy does a DNS lookup to some external resources, a ipv6 address is returned for the external resource and proxying fails with a 503 error. Setting this field ensures an IPv4 is returned so Envoy proxies correctly. I tested this in a GKE cluster as well as my home lab, both showed the same issue and was also resolved with the same fix.

Chatting with @moderation in an Envoy issue (envoyproxy/envoy#13037 (comment)), he verified that v3 resolved the issue which we haven't yet moved to, or by setting the dns lookup family might also help.

Thanks for linking the Envoy issue. It's not clear to me from the issue whether the problem was that Envoy couldn't connect to the v6 target for some reason, or whether it hit an internal bug first. Possibly, the bug is that it does't fall back to v4 on a connection failure?

@moderation
Copy link

To clarify I don't think changing to v3 fixed the issue, it was the addition of dns_lookup_family: V4_ONLY. Not sure hard coding this to V4 makes sense. Although rare I've heard of V6 only environments. Seems like something that should be configurable based on the environment where Contour is running. I think there could be quite a few permutation here too. A DNS that resolves V6 and V4 addresses but the network is V4. A dual stack network but with DNS that only resolves V4 etc.

@stevesloka
Copy link
Member Author

I had some more discussion in the upstream Envoy issue and think this isn't 100% an Envoy issue. The Envoy setting dns_lookup_family determines how Envoy will lookup dns entries. The default which Contour uses today is AUTO, meaning Envoy will try to find an ipv6 address first, then fallback to ipv4 addresses.

Since some sites are returning ipv6 addresses, it's then up to the network capabilities of the cluster's network to allow the routing to work. I think this is why we're seeing the errors in the issue referenced.

To fully resolve I think we should add a config setting which will allow users to determine which setting to pass to envoy, be that the ipv4, ipv6, or auto (which should remain the default).

@youngnick @skriss thoughts?

@youngnick
Copy link
Member

I think that's the right way to do this, thanks @stevesloka.

@stevesloka
Copy link
Member Author

Ok this is updated with a new config file option. I named it "cluster", but not sure if that makes sense or not. This new setting defines what is used for the dns_lookup_family on any outgoing request to an externalName type cluster.

@stevesloka stevesloka force-pushed the externalNameDNSLookup branch from 8c95549 to c4aa289 Compare September 16, 2020 17:57
@stevesloka stevesloka added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Sep 16, 2020
Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks reasonable to me -- it seems unlikely that this would need to be toggled on a per-route basis so a global config file setting makes sense.

Need to update configuration.md as well.

cmd/contour/serve.go Outdated Show resolved Hide resolved
cmd/contour/serve.go Outdated Show resolved Hide resolved
internal/envoy/cluster_test.go Outdated Show resolved Hide resolved
cmd/contour/servecontext_test.go Outdated Show resolved Hide resolved
cmd/contour/servecontext.go Outdated Show resolved Hide resolved

// ClusterConfig holds various configurable Envoy cluster values that can
// be set in the config file.
ClusterConfig `yaml:"cluster,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a better suggestion for naming here. I guess this could just be a top-level field in the config file too. 🤷‍♂️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that's where I started, but if we ever needed to add another "cluster" type config it's nice to have this section for it.

cmd/contour/serve.go Outdated Show resolved Hide resolved
@skriss
Copy link
Member

skriss commented Sep 17, 2020

Still need to update site/docs/main/configuration.md too.

@stevesloka stevesloka force-pushed the externalNameDNSLookup branch from 6cadac7 to e2730d0 Compare September 18, 2020 15:59
@stevesloka
Copy link
Member Author

@skriss added configuration.md as well. =)

@stevesloka stevesloka requested a review from skriss September 18, 2020 16:01
@skriss
Copy link
Member

skriss commented Sep 18, 2020

Changes look good; you've got some merge conflicts now.

@stevesloka stevesloka force-pushed the externalNameDNSLookup branch from e2730d0 to efa1a86 Compare September 18, 2020 18:13
@@ -1,33 +1,27 @@
<p>Packages:</p>
<ul>
<li>
<a href="#projectcontour.io%2fv1">projectcontour.io/v1</a>
<a href="#projectcontour.io%2fv1alpha1">projectcontour.io/v1alpha1</a>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this file was changed. Maybe another PR didn't update properly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe an unstable sort within the doc generator? Looks like it switched the order of v1 and v1alpha1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I check out your branch and re-run the generate, it undoes this change..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I spotted the issue in the API docs generator, TLDR it's using a map somewhere along the line which potentially throws away the sort

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just remove this file from the PR since you didn't actually change anything in the API?

Copy link
Member

@skriss skriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It'd be nice to figure out how to resolve the API ref docs sorting issue separately now that we have >1 group, otherwise we're gonna get a lot of noise from this going fwd. We can try getting a patch upstream (looks like there might be an open PR with a fix that we could make some noise on), else we may need to fork the tool.

@jpeach
Copy link
Contributor

jpeach commented Sep 20, 2020

Are we sure that this makes sense as a global config option? To me, it seems likely that you may need to set this differently for different targets, which argues for an annotation. I could see the utility of setting a default in addition to that though.

@youngnick
Copy link
Member

I think this is most likely to be an install-wide problem, caused by the shenanigans around AAAA and A records when you've got V4 and V6 around, so that's why I supported the global config first. I think that if we need more configurability, we can add it later, with this as the default.

@jpeach
Copy link
Contributor

jpeach commented Sep 21, 2020

I think this is most likely to be an install-wide problem, caused by the shenanigans around AAAA and A records when you've got V4 and V6 around, so that's why I supported the global config first. I think that if we need more configurability, we can add it later, with this as the default.

It's an external service, managed by an entirely separate org. That separate org is the one publishing DNS and both the org and the I remember work paths are factors in determining whether IPv6 connectivity works. It's extremely likely that only one out of many external services would exhibit similar problems. At least spin off an issue for the per-service config.

What was the root cause in the problem that triggered this PR? Was it a local install problem or a external service problem?

@youngnick
Copy link
Member

You're right, it was an external service problem.

@stevesloka
Copy link
Member Author

The problem I ran into is the local cluster wouldn't route ipv6. The external service returned an ipv6 address first since that's what we configured envoy to do when using the default of "auto".

That's said, this is a local cluster problem. Nothing to do with the external one. I could see a different issue to add seperate configs for each service type, but not needed today.

@youngnick
Copy link
Member

So, I've thought about this some more, and I'm still supportive of this config being a global config, for two reasons:

  • it fixes the ExternalService issue as it's supposed to, when the ExternalService referenced has a V6 record and the Envoy's can't route there.
  • it can also be used to improve DNS lookup performance in clusters that only do V4 - it's pretty common for a large percentage of DNS queries in a V4-only cluster to be spurious AAAA ones, this is a neat knob to help with reducing that a bit.

Adds a config option for DnsLookupFamily allowing users to define what
dns lookup family is used on any cluster that is referenced via an
externalName type cluster. This ensures that lookups to external resources
are resolved correctly.

Fixes projectcontour#2873

Signed-off-by: Steve Sloka <slokas@vmware.com>
@stevesloka stevesloka force-pushed the externalNameDNSLookup branch from efa1a86 to 35e524b Compare September 23, 2020 15:31
@stevesloka stevesloka merged commit 12ace66 into projectcontour:main Sep 23, 2020
@stevesloka stevesloka deleted the externalNameDNSLookup branch September 23, 2020 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

proxy to externalName service returns 503
5 participants