-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terraform 1.3.1 (and 1.3.0) forcing and failing DNS resolution on IPv6 #31935
Comments
Let me add that I tried compiling terraform and see if the flag Running a build with |
Thanks for the well-written issue. I agree this looks distinct from #31467, especially given that the issue seems to have appeared in v1.3.0. Terraform v1.3.0 was compiled with Go 1.19, whereas v1.2.9 uses Go 1.18, which could have caused some sort of regression, though I'm not sure why yet. When you ran |
go 1.19 |
After reading another potentially related issue golang/go#52839 with Go, I gave it a try to build it. It's weird that an issue reported mainly for Mac OS users is also affecting WSL |
Interesting. The Would you mind running |
So here it goes:
I realized I did my tests with switched binaries, now I did the runs accordingly, so with CGO_ENABLED=1 runs well and with it disabled doesn't run at all. I noticed that your build script |
The Rather than try to force a resolver choice, using |
happy to help sort this out. @jbardin
|
At the risk of piling on 😬 I notice that this situation seems a little different than our typical DNS resolution problems on macOS: It seems that the DNS lookup did actually succeed here, because the error message mentions trying to establish a TCP connection to the standard HTTPS port on the CDN we currently use. That means that Terraform did successfully look up a hostname, but apparently the response contained an IPv6 address (an Virtualization/emulation layers like WSL on Windows and Rosetta on macOS can unfortunately add a bunch of extra unknowns compared to running on the native OS. For Windows users I'd typically recommend using the Windows builds of Terraform which are designed to run on that platform, rather than the Linux builds which are intended to run on standard Linux distributions. However, I understand that sometimes it's helpful to be able to use a Linux userspace alongside Terraform, and so if we can I'd like to figure out if there's something different about the WSL userspace compared to a typical Linux system that might be causing this different result, and see if we can adapt to it. The pure Go resolver that we use on Linux typically ends up making DNS requests to the servers specified in |
Thanks, the namserver is my home gateway, and both repos, ending in. io and .com are fully solved returning IPv4 addresses. That's what puzzles me, it's the only tool I've in WSL behaving like this (starting with 1.3.0) |
So I did some extra search on WSL2 and IPv6 support, and seems it's not yet fully implemented. the kernel does not have the bits for IPv6 routing. There's other issues related to this, so that could explain why it's not solving IPv6, still it should allow fallback to IPv4. I will try to see if I can compile a WSL2 kernel with IPv6 |
I found some DNS-related changes that seem to be new in Go 1.19:
The following upstream issues are related to these:
It's interesting to see that the participants in the issues above say that older versions of Terraform were not previously working in WSL, which seems to be the opposite of what this issue is representing. (Admittedly nobody has reported that these changes did fix Terraform, so all we know right now is that some WSL systems cannot DNS on Terraform v1.2 and earlier, and some WSL systems cannot DNS on v1.3 and later but do work with v1.2 and earlier. It remains unclear whether these situations are connected.) I have not yet done anything to confirm this, because I don't currently have access to a Windows system with WSL to test with, but my unsubstantiated theory based on the above is that something on the path between you and our DNS servers was trying to work around the problem that caused golang/go#44135 by omitting the IPv6 records from the response so the packets would be shorter, but now the Go resolver is allowing a longer response size and so that workaround no longer applies and so the server is returning both the I'm not sure yet how best to test this. Perhaps it would be possible to make a custom build of the Go toolchain that omits those particular commits and see if that works better, but that seems pretty finicky and so hopefully we can find a more convenient way to test this theory without creating any custom builds, such as monitoring the DNS requests and responses from both the working and non-working versions using a packet capture tool. |
Today I tried the packet capture technique to try to quickly disprove my above theory, and I succeeded in disproving it. This does not seem to be the result of a change in the pure Go DNS resolver. The rest of this is some details about what I did in case anyone wants to poke holes in my methodology. 😀 I downloaded and extracted the official $ /tmp/terraform12 version
Terraform v1.2.9
on linux_amd64
+ provider registry.terraform.io/hashicorp/null v3.1.1
Your version of Terraform is out of date! The latest version
is 1.3.2. You can update by downloading from https://www.terraform.io/downloads.html
$ /tmp/terraform13 version
Terraform v1.3.2
on linux_amd64
+ provider registry.terraform.io/hashicorp/null v3.1.1 I'm using the I'm working in a configuration that contains only a requirement for the With each of those executables in turn, I:
After this I carefully inspected the query and answer packets related to I can see both versions are sending the "EDNS" extension record, from which I conclude that both versions include net: send EDNS(0) packet length in DNS query. I also carefully compared the packets from both versions byte-for-byte. In both cases Terraform sent queries for both The responses were not exactly identical but as far as I can tell they only varied in ways that are reasonable: some of the records had a different TTL in one response than the other, and a couple of the results were returned in a different order. Based on this, I'm concluding that there is no substantial difference in DNS resolver behavior between the official v1.2.9 and v1.3.2 builds, and therefore the cause for this difference in behavior must lie elsewhere. I think the next area of interest is whatever logic in the Go network stack selects only one of the many different IPv4 and IPv6 addresses to try to connect to; I'm wondering if the network library is now giving higher preference to the IPv6 addresses than it used to, for some reason. |
Immediately after sending the previous message I realized I have skipped a step: I also intended to compare the results from a |
I have also now poked a hole in my own methodology: by monitoring outgoing packets I've been testing the behavior of I'm going to repeat what I did above while monitoring the communication between Terraform and |
Compiling the kernel with all ipv6 flags, was useless, still missing other bits for ipv6 routing |
Okay, some more interesting results now that I'm actually monitoring what I intended to monitor. 🙄 The EDNS extension packet is different in each case:
In the last case, the resolver implementation is the one from my own system's libc, which happens to be Ubuntu glibc 2.31-0ubuntu9.9. So that particular case is likely to vary on other systems with different libc. I'm not sure if the Ubuntu 20.04 image for WSL has the same libc, but I'm guessing probably so since I expect they intend to be binary compatible with "normal" Ubuntu 20.04. I think this puts the EDNS theory back on the table again. However, my initial mistake did draw my attention to something I didn't previously consider: If Ubuntu 20.04 in WSL also uses systemd and also has the @pacorreia you previously stated that your
If you aren't using (Side note: phew, there are a lot of moving parts here! 😬 ) |
So, for WSL2, by default, there's no systemd working unless one tweak some bits to fake it, but for this case, definitely it only relies on what I've set in resolv.conf, no other process involved |
Okay, I think I've finally figured out what was going on upstream for these changes in each of the releases relevant to us:
|
Thanks for confirming, @pacorreia! Unfortunately it seems like it's going to be hard for me to successfully reproduce exactly what's true on your system, now that I know that any intermediate DNS resolver can potentially lower the advertised maximum packet size when it forwards a query. Even if I temporarily disabled Do you think you have the necessary software, time, and expertise to try to reproduce what I was doing on your system? I can't give you exact instructions because the details are pretty fiddly and it would take me all evening to write it out 😬 (and I'm at the end of my work week now anyway), but here's a summary of what I was doing here:
If you're not able to try this for any reason then no worries... we can try to find a different way to investigate this. But if you can capture these packets and share what you learn then I think that'll be the most direct way to prove or disprove my theory without having to make any custom builds of Go and Terraform. I'm about to be away for a long weekend so I'll be quiet for a bit now, but my other colleagues on the team might jump in here if you're able to turn up something which is a good lead to tug on some more. Otherwise, I'll check back in next week. Thanks! |
many many thanks for this great work you did, and yes I got the idea of what you did. and 2ill try to follow, here is already 1:48 AM. Tomorrow will try to capture those results and get conclusions to share 👌 |
@apparentlymart So I followed your trail and was able to capture traffic and look for the fields you mentioned. Indeed for 1.2.9 there's no "Additional records" appearing, but for the 1.3.1 release, there's and in my case the UDP payload size was: 1232. |
Found the reason why terraform was failing for me with WSL2, although, it shouldn't. By default WSL2 sets a NAT interface on the host and shares host Internet with VMs. So my setup is actually using my home gateway as nameserver, bypassing the host NAT for WSL. As soon I changed back to use the host NAT IP address as nameserver, Terraform 1.3.0 and latest versions, started working. Still I'm intrigued why the other way breaks the way Go gets and uses the results. I'll run the packet capture with this new change and post the results later |
My conclusion on this is, as long one uses the default configuration for the nameserver to point to the host IP, it will work. It's interesting that using a different config, breaks the normal behavior. I'll close this issue, as it works under WSL2, it just does not like custom dns configuration |
Thanks for following up, @pacorreia! It does sound strange to me that using your normal resolver would cause different behavior but I have a guess as to why: perhaps the intermediate resolver normally used in WSL knows that the WSL environment doesn't have a functioning IPv6 interface and so it locally filters out the By bypassing that local resolver you allowed Terraform to see that there is an IPv6 address available and then Terraform tried to connect to it. It isn't clear to me yet why the usual fallback to IPv4 didn't work here, but I suspect that's probably an artifact of how WSL virtualizes the network connection: it is perhaps responding in a different way than is typical for a slow or non-functional IPv6 connection, which is then causing the Go network implementation to treat it as a fatal error rather than falling back to a different address. I'm glad we have an explanation at least, even if it's an incomplete one, and that you found a configuration that works. Thanks again! |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Terraform Version
Terraform Configuration Files
Debug Output
https://gist.github.com/pacorreia/ad906b63a5884c31c451c7cfc7022042
Expected Behavior
terraform init -upgrade
Initializing the backend...
Initializing provider plugins...
Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.
Terraform has been successfully initialized!
You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.
If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
Actual Behavior
terraform init -upgrade
Initializing the backend...
Initializing provider plugins...
╷
│ Error: Failed to query available provider packages
│
│ Could not retrieve the list of available versions for provider hashicorp/null: could not query provider registry for registry.terraform.io/hashicorp/null: the request failed after 2 attempts, please try again later: Get
│ "https://registry.terraform.io/v1/providers/hashicorp/null/versions": dial tcp [2a04:4e42:86::561]:443: connect: network is unreachable
╵
╷
│ Error: Failed to query available provider packages
│
│ Could not retrieve the list of available versions for provider hashicorp/azurerm: could not query provider registry for registry.terraform.io/hashicorp/azurerm: the request failed after 2 attempts, please try again later: Get
│ "https://registry.terraform.io/v1/providers/hashicorp/azurerm/versions": dial tcp [2a04:4e42:86::561]:443: connect: network is unreachable
Steps to Reproduce
Additional Context
The same system runs perfectly well with terraform 1.2.9
More details:
Linux 5.10.102.1-microsoft-standard-WSL2 x86_64 GNU/Linux
I already set IPv4 preference on /etc/gai.conf but without success.
Below a gif showing the issue and that I have connectivity:
References
Possibly linked
The text was updated successfully, but these errors were encountered: