Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform 1.3.1 (and 1.3.0) forcing and failing DNS resolution on IPv6 #31935

Closed
pacorreia opened this issue Oct 4, 2022 · 27 comments
Closed
Labels
bug new new issue not yet triaged v1.3 Issues (primarily bugs) reported against v1.3 releases

Comments

@pacorreia
Copy link

Terraform Version

Terraform v1.3.1
on linux_amd64

Terraform Configuration Files

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "3.25.0"
    }
  }

  required_version = ">= 1.2.9"
}

provider "null" {}
provider "azurerm" {
  features {}
}

variable "test" {
  type = map(string)

  default = {
    test = 1
    key  = 2
  }
}

Debug Output

https://gist.github.com/pacorreia/ad906b63a5884c31c451c7cfc7022042

Expected Behavior

terraform init -upgrade

Initializing the backend...

Initializing provider plugins...

  • Finding hashicorp/azurerm versions matching "3.25.0"...
  • Finding latest version of hashicorp/null...
  • Installing hashicorp/azurerm v3.25.0...
  • Installed hashicorp/azurerm v3.25.0 (signed by HashiCorp)
  • Installing hashicorp/null v3.1.1...
  • Installed hashicorp/null v3.1.1 (signed by HashiCorp)

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

Actual Behavior

terraform init -upgrade

Initializing the backend...

Initializing provider plugins...

  • Finding latest version of hashicorp/null...
  • Finding hashicorp/azurerm versions matching "3.25.0"...

    │ Error: Failed to query available provider packages

    │ Could not retrieve the list of available versions for provider hashicorp/null: could not query provider registry for registry.terraform.io/hashicorp/null: the request failed after 2 attempts, please try again later: Get
    │ "https://registry.terraform.io/v1/providers/hashicorp/null/versions": dial tcp [2a04:4e42:86::561]:443: connect: network is unreachable


│ Error: Failed to query available provider packages

│ Could not retrieve the list of available versions for provider hashicorp/azurerm: could not query provider registry for registry.terraform.io/hashicorp/azurerm: the request failed after 2 attempts, please try again later: Get
│ "https://registry.terraform.io/v1/providers/hashicorp/azurerm/versions": dial tcp [2a04:4e42:86::561]:443: connect: network is unreachable

Steps to Reproduce

  1. Install Terraform 1.3.1 onto WSL 2 (Ubuntu 20.04)
  2. Create config with or more providers available at hashicorp registry
  3. Run terraform init

Additional Context

The same system runs perfectly well with terraform 1.2.9

More details:
Linux 5.10.102.1-microsoft-standard-WSL2 x86_64 GNU/Linux

I already set IPv4 preference on /etc/gai.conf but without success.

Below a gif showing the issue and that I have connectivity:
tf-dns-issue

References

Possibly linked

@pacorreia pacorreia added bug new new issue not yet triaged labels Oct 4, 2022
@kmoe kmoe added the v1.3 Issues (primarily bugs) reported against v1.3 releases label Oct 4, 2022
@pacorreia
Copy link
Author

Let me add that I tried compiling terraform and see if the flag CGO_ENABLED=0would make a difference.

Running a build with CGO_ENABLED=0 go build . produced a 1.3.1 binary that actually works

@kmoe
Copy link
Member

kmoe commented Oct 4, 2022

Thanks for the well-written issue. I agree this looks distinct from #31467, especially given that the issue seems to have appeared in v1.3.0. Terraform v1.3.0 was compiled with Go 1.19, whereas v1.2.9 uses Go 1.18, which could have caused some sort of regression, though I'm not sure why yet.

When you ran CGO_ENABLED=0 go build ., which version of Go did you use?

@pacorreia
Copy link
Author

Thanks for the well-written issue. I agree this looks distinct from #31467, especially given that the issue seems to have appeared in v1.3.0. Terraform v1.3.0 was compiled with Go 1.19, whereas v1.2.9 uses Go 1.18, which could have caused some sort of regression, though I'm not sure why yet.

When you ran CGO_ENABLED=0 go build ., which version of Go did you use?

go 1.19

@pacorreia
Copy link
Author

After reading another potentially related issue golang/go#52839 with Go, I gave it a try to build it.

It's weird that an issue reported mainly for Mac OS users is also affecting WSL

@kmoe
Copy link
Member

kmoe commented Oct 4, 2022

Interesting. The CGO_ENABLED=0 go build . command should have produced a v1.3.1 binary identical (w.r.t. CGO at least...) to the released binary.

Would you mind running GODEBUG=netdns=cgo+2 terraform init using both of your v1.3.1 binary versions and pasting the first few lines of the output?

@pacorreia
Copy link
Author

pacorreia commented Oct 4, 2022

So here it goes:

  • CGO_ENABLED=0
Initializing the backend...

Initializing provider plugins...
- Finding latest version of hashicorp/null...
go package net: confVal.netCgo = true  netGo = true
go package net: built with netgo build tag; using Go's DNS resolver
go package net: hostLookupOrder(registry.terraform.io) = files,dns
go package net: hostLookupOrder(registry.terraform.io) = files,dns
go package net: hostLookupOrder(registry.terraform.io) = files,dns
- Finding hashicorp/azurerm versions matching "3.25.0"...
go package net: hostLookupOrder(registry.terraform.io) = files,dns
go package net: hostLookupOrder(registry.terraform.io) = files,dns
╷
│ Error: Failed to query available provider packages
│ 
│ Could not retrieve the list of available versions for provider hashicorp/null: could not query provider registry for registry.terraform.io/hashicorp/null: the request failed after 2 attempts, please try again later: Get
│ "https://registry.terraform.io/v1/providers/hashicorp/null/versions": dial tcp [2a04:4e42:86::561]:443: connect: network is unreachable
╵

╷
│ Error: Failed to query available provider packages
│ 
│ Could not retrieve the list of available versions for provider hashicorp/azurerm: could not query provider registry for registry.terraform.io/hashicorp/azurerm: the request failed after 2 attempts, please try again later: Get
│ "https://registry.terraform.io/v1/providers/hashicorp/azurerm/versions": dial tcp [2a04:4e42:86::561]:443: connect: network is unreachable
  • CGO_ENABLED=1
Initializing the backend...

Initializing provider plugins...
- Finding hashicorp/azurerm versions matching "3.25.0"...
go package net: confVal.netCgo = true  netGo = false
go package net: using cgo DNS resolver
go package net: hostLookupOrder(registry.terraform.io) = cgo
go package net: hostLookupOrder(registry.terraform.io) = cgo
- Finding latest version of hashicorp/null...
go package net: hostLookupOrder(registry.terraform.io) = cgo
go package net: hostLookupOrder(registry.terraform.io) = cgo
go package net: hostLookupOrder(releases.hashicorp.com) = cgo
- Installing hashicorp/azurerm v3.25.0...
go package net: hostLookupOrder(releases.hashicorp.com) = cgo
- Installed hashicorp/azurerm v3.25.0 (signed by HashiCorp)
go package net: hostLookupOrder(registry.terraform.io) = cgo
go package net: hostLookupOrder(releases.hashicorp.com) = cgo
- Installing hashicorp/null v3.1.1...
go package net: hostLookupOrder(releases.hashicorp.com) = cgo
- Installed hashicorp/null v3.1.1 (signed by HashiCorp)
  • With binary published on github:
Initializing the backend...

Initializing provider plugins...
- Finding hashicorp/azurerm versions matching "3.25.0"...
go package net: confVal.netCgo = true  netGo = true
go package net: built with netgo build tag; using Go's DNS resolver
go package net: hostLookupOrder(registry.terraform.io) = files,dns
go package net: hostLookupOrder(registry.terraform.io) = files,dns
go package net: hostLookupOrder(registry.terraform.io) = files,dns
- Finding latest version of hashicorp/null...
go package net: hostLookupOrder(registry.terraform.io) = files,dns
go package net: hostLookupOrder(registry.terraform.io) = files,dns
╷
│ Error: Failed to query available provider packages
│ 
│ Could not retrieve the list of available versions for provider hashicorp/azurerm: could not query provider registry for registry.terraform.io/hashicorp/azurerm: the request failed after 2 attempts, please try again later: Get
│ "https://registry.terraform.io/v1/providers/hashicorp/azurerm/versions": dial tcp [2a04:4e42:86::561]:443: connect: network is unreachable
╵

╷
│ Error: Failed to query available provider packages
│ 
│ Could not retrieve the list of available versions for provider hashicorp/null: could not query provider registry for registry.terraform.io/hashicorp/null: the request failed after 2 attempts, please try again later: Get
│ "https://registry.terraform.io/v1/providers/hashicorp/null/versions": dial tcp [2a04:4e42:86::561]:443: connect: network is unreachable

I realized I did my tests with switched binaries, now I did the runs accordingly, so with CGO_ENABLED=1 runs well and with it disabled doesn't run at all.

I noticed that your build script build.sh at line 36 has this export CGO_ENABLED=0 but this has been there for quite some time

@jbardin
Copy link
Member

jbardin commented Oct 4, 2022

The build.sh script is an old artifact, and probably should be removed. The linux build process however still uses CGO_ENABLED=0 by default. Unlike some platforms like solaris and darwin, linux operating systems do not have a universally portable libc implementation, so building with cgo by default for linux is probably not an option.

Rather than try to force a resolver choice, using GODEBUG=netdns=1 with cgo enabled may show why the cgo resolver must be picked over the netgo resolver (which should still be the default on linux even with cgo). Once we know the local system configuration which causes the change in behavior, we maybe able to make a decision on how to support this configuration moving forward.

@jbardin jbardin closed this as completed Oct 4, 2022
@jbardin jbardin reopened this Oct 4, 2022
@pacorreia
Copy link
Author

pacorreia commented Oct 4, 2022

The build.sh script is an old artifact, and probably should be removed. The linux build process however still uses CGO_ENABLED=0 by default. Unlike some platforms like solaris and darwin, linux operating systems do not have a universally portable libc implementation, so building with cgo by default for linux is probably not an option.

Rather than try to force a resolver choice, using GODEBUG=netdns=1 with cgo enabled may show why the cgo resolver must picked over the netgo resolver (which should still be the default on linux even with cgo). Once we know the local system configuration which causes the change in behavior, we maybe able to make a decision on how to support this configuration moving forward.

happy to help sort this out.

@jbardin
result of using GODEBUG=netdns=1

Initializing the backend...

Initializing provider plugins...
- Reusing previous version of hashicorp/null from the dependency lock file
go package net: built with netgo build tag; using Go's DNS resolver
- Reusing previous version of hashicorp/azurerm from the dependency lock file
╷
│ Error: Failed to query available provider packages
│ 
│ Could not retrieve the list of available versions for provider hashicorp/null: could not query provider registry for registry.terraform.io/hashicorp/null: the request failed after 2 attempts, please try again later: Get
│ "https://registry.terraform.io/v1/providers/hashicorp/null/versions": dial tcp [2a04:4e42:86::561]:443: connect: network is unreachable
╵

╷
│ Error: Failed to query available provider packages
│ 
│ Could not retrieve the list of available versions for provider hashicorp/azurerm: could not query provider registry for registry.terraform.io/hashicorp/azurerm: the request failed after 2 attempts, please try again later: Get
│ "https://registry.terraform.io/v1/providers/hashicorp/azurerm/versions": dial tcp [2a04:4e42:86::561]:443: connect: network is unreachable

@apparentlymart
Copy link
Contributor

At the risk of piling on 😬 I notice that this situation seems a little different than our typical DNS resolution problems on macOS:

It seems that the DNS lookup did actually succeed here, because the error message mentions trying to establish a TCP connection to the standard HTTPS port on the CDN we currently use. That means that Terraform did successfully look up a hostname, but apparently the response contained an IPv6 address (an AAAA record) which Terraform then attempted to connect to. Since Terraform implements the fast fallback algorithm, I assume this means that the DNS result only included an AAAA record and not also an A record, which is strange.

Virtualization/emulation layers like WSL on Windows and Rosetta on macOS can unfortunately add a bunch of extra unknowns compared to running on the native OS. For Windows users I'd typically recommend using the Windows builds of Terraform which are designed to run on that platform, rather than the Linux builds which are intended to run on standard Linux distributions. However, I understand that sometimes it's helpful to be able to use a Linux userspace alongside Terraform, and so if we can I'd like to figure out if there's something different about the WSL userspace compared to a typical Linux system that might be causing this different result, and see if we can adapt to it.

The pure Go resolver that we use on Linux typically ends up making DNS requests to the servers specified in /etc/resolv.conf, so it might be interesting to look in there and see if the DNS server addresses listed in that file seem realistic and whether you can replicate the results Terraform is seeing by querying those DNS servers directly. (For example, if you have the "host" tool installed then you could ask host -v registry.terraform.io 8.8.8.8 to ask the nameserver at 8.8.8.8. Put each of your hostnames from /etc/resolv.conf in place of that IP address to see if they are all returning consistent results and if you see both A (IPv4) and AAAA (IPv6) addresses.

@pacorreia
Copy link
Author

pacorreia commented Oct 4, 2022

At the risk of piling on 😬 I notice that this situation seems a little different than our typical DNS resolution problems on macOS:

It seems that the DNS lookup did actually succeed here, because the error message mentions trying to establish a TCP connection to the standard HTTPS port on the CDN we currently use. That means that Terraform did successfully look up a hostname, but apparently the response contained an IPv6 address (an AAAA record) which Terraform then attempted to connect to. Since Terraform implements the fast fallback algorithm, I assume this means that the DNS result only included an AAAA record and not also an A record, which is strange.

Virtualization/emulation layers like WSL on Windows and Rosetta on macOS can unfortunately add a bunch of extra unknowns compared to running on the native OS. For Windows users I'd typically recommend using the Windows builds of Terraform which are designed to run on that platform, rather than the Linux builds which are intended to run on standard Linux distributions. However, I understand that sometimes it's helpful to be able to use a Linux userspace alongside Terraform, and so if we can I'd like to figure out if there's something different about the WSL userspace compared to a typical Linux system that might be causing this different result, and see if we can adapt to it.

The pure Go resolver that we use on Linux typically ends up making DNS requests to the servers specified in /etc/resolv.conf, so it might be interesting to look in there and see if the DNS server addresses listed in that file seem realistic and whether you can replicate the results Terraform is seeing by querying those DNS servers directly. (For example, if you have the "host" tool installed then you could ask host -v registry.terraform.io 8.8.8.8 to ask the nameserver at 8.8.8.8. Put each of your hostnames from /etc/resolv.conf in place of that IP address to see if they are all returning consistent results and if you see both A (IPv4) and AAAA (IPv6) addresses.

Thanks, the namserver is my home gateway, and both repos, ending in. io and .com are fully solved returning IPv4 addresses.

That's what puzzles me, it's the only tool I've in WSL behaving like this (starting with 1.3.0)

image

@pacorreia
Copy link
Author

So I did some extra search on WSL2 and IPv6 support, and seems it's not yet fully implemented. the kernel does not have the bits for IPv6 routing.

There's other issues related to this, so that could explain why it's not solving IPv6, still it should allow fallback to IPv4.

microsoft/WSL#5855

I will try to see if I can compile a WSL2 kernel with IPv6

@apparentlymart
Copy link
Contributor

apparentlymart commented Oct 5, 2022

I found some DNS-related changes that seem to be new in Go 1.19:

  • net: send EDNS(0) packet length in DNS query: this seems relatively innocuous, but I suppose it's plausible that advertising a larger supported response packet size invites some resolvers to return different answers that wouldn't have otherwise fit in the older size. AAAA records are larger than A records.

    This change links to a page about DNS flag day 2020. That page is no longer available, but it formerly linked to IPv6, Large UDP Packets and the DNS which discusses that smaller DNS packet sizes do indeed cause fragmentation for IPv6-related requests, so it seems plausible that some DNS servers would answer differently depending on the packet size in order to sidestep that overhead.

    However, I think this change was also backported into a Go 1.18 patch release, so our later Terraform v1.2 series releases may also have this change. (I haven't checked.)

  • net: increase maximum accepted DNS packet to 1232 bytes seems to be a successor to the previous one that wasn't backported to Go 1.18, because it was apparently more risky.

    This one specifically mentions that it aims to alter behavior for WSL, although of course it intends to make the behaviour better on WSL rather than worse, but it may have had a similar unintended side-effect of making DNS servers answer differently when allowed to send larger packets.

The following upstream issues are related to these:

It's interesting to see that the participants in the issues above say that older versions of Terraform were not previously working in WSL, which seems to be the opposite of what this issue is representing. (Admittedly nobody has reported that these changes did fix Terraform, so all we know right now is that some WSL systems cannot DNS on Terraform v1.2 and earlier, and some WSL systems cannot DNS on v1.3 and later but do work with v1.2 and earlier. It remains unclear whether these situations are connected.)

I have not yet done anything to confirm this, because I don't currently have access to a Windows system with WSL to test with, but my unsubstantiated theory based on the above is that something on the path between you and our DNS servers was trying to work around the problem that caused golang/go#44135 by omitting the IPv6 records from the response so the packets would be shorter, but now the Go resolver is allowing a longer response size and so that workaround no longer applies and so the server is returning both the A records and the AAAA records. Then for some reason not yet explained the Go network stack is preferring to use the IPv6 address instead of the IPv4 address, which fails because your system has a non-functional IPv6 setup (which is true for all WSL, according to microsoft/WSL#5855).

I'm not sure yet how best to test this. Perhaps it would be possible to make a custom build of the Go toolchain that omits those particular commits and see if that works better, but that seems pretty finicky and so hopefully we can find a more convenient way to test this theory without creating any custom builds, such as monitoring the DNS requests and responses from both the working and non-working versions using a packet capture tool.

@apparentlymart
Copy link
Contributor

Today I tried the packet capture technique to try to quickly disprove my above theory, and I succeeded in disproving it. This does not seem to be the result of a change in the pure Go DNS resolver.

The rest of this is some details about what I did in case anyone wants to poke holes in my methodology. 😀


I downloaded and extracted the official .zip archives for two Terraform versions:

$ /tmp/terraform12 version
Terraform v1.2.9
on linux_amd64
+ provider registry.terraform.io/hashicorp/null v3.1.1

Your version of Terraform is out of date! The latest version
is 1.3.2. You can update by downloading from https://www.terraform.io/downloads.html

$ /tmp/terraform13 version
Terraform v1.3.2
on linux_amd64
+ provider registry.terraform.io/hashicorp/null v3.1.1

I'm using the linux_amd64 builds, which is the same platform used under WSL. However, I'm testing this on the Ubuntu system I use for my everyday Terraform work and not in WSL. For my purposes here I don't think this matters, because I'm primarily interested in which DNS queries Terraform is sending and the DNS resolver code in the linux_amd64 builds is the same regardless of whether running in Linux on real hardware or Linux in WSL 2.

I'm working in a configuration that contains only a requirement for the hashicorp/null provider, purely so that terraform init will have a reason to contact Terraform Registry.

With each of those executables in turn, I:

  • Cleared my local DNS cache by restarting my system's local DNS resolver. (systemd-resolved)
  • Deleted any existing .terraform directory to ensure I'm starting from a clean slate.
  • Started capturing any outgoing ethernet packets from my system where the protocol is either TCP or UDP and the destination port is 53 (which is the well-known port for DNS).
  • Ran terraform init and watched it download and install hashicorp/null.
  • Halted the packet capture.

After this I carefully inspected the query and answer packets related to registry.terraform.io from both versions, and looked for any differences.

I can see both versions are sending the "EDNS" extension record, from which I conclude that both versions include net: send EDNS(0) packet length in DNS query.

I also carefully compared the packets from both versions byte-for-byte. In both cases Terraform sent queries for both IN A registry.terraform.io and IN AAAA registry.terraform.io. The queries are byte-for-byte identical aside from the first two bytes, which represent the DNS transaction ID and are therefore expected to vary.

The responses were not exactly identical but as far as I can tell they only varied in ways that are reasonable: some of the records had a different TTL in one response than the other, and a couple of the results were returned in a different order.

Based on this, I'm concluding that there is no substantial difference in DNS resolver behavior between the official v1.2.9 and v1.3.2 builds, and therefore the cause for this difference in behavior must lie elsewhere. I think the next area of interest is whatever logic in the Go network stack selects only one of the many different IPv4 and IPv6 addresses to try to connect to; I'm wondering if the network library is now giving higher preference to the IPv6 addresses than it used to, for some reason.

@apparentlymart
Copy link
Contributor

Immediately after sending the previous message I realized I have skipped a step: I also intended to compare the results from a CGO_ENABLED=1 build with the official builds. I will do that now to see how that affects things, before I start studying the interactions between the address selection algorithm and the resolver.

@apparentlymart
Copy link
Contributor

I have also now poked a hole in my own methodology: by monitoring outgoing packets I've been testing the behavior of systemd-resolved rather than of the resolver implementation inside the Terraform executable.

I'm going to repeat what I did above while monitoring the communication between Terraform and systemd-resolved instead. In other words, I'm going to monitor local loopback instead of my real network interface, because my /etc/resolv.conf contains 127.0.0.53.

@pacorreia
Copy link
Author

Compiling the kernel with all ipv6 flags, was useless, still missing other bits for ipv6 routing

@apparentlymart
Copy link
Contributor

Okay, some more interesting results now that I'm actually monitoring what I intended to monitor. 🙄

The EDNS extension packet is different in each case:

  • Official Terraform v1.2.9: not present at all
  • Official Terraform v1.3.2: present, and advertises maximum response packet size 1232
  • Locally-built Terraform v1.3.2 with CGO_ENABLED=1: present, and advertises maximum response packet size 1200

In the last case, the resolver implementation is the one from my own system's libc, which happens to be Ubuntu glibc 2.31-0ubuntu9.9. So that particular case is likely to vary on other systems with different libc. I'm not sure if the Ubuntu 20.04 image for WSL has the same libc, but I'm guessing probably so since I expect they intend to be binary compatible with "normal" Ubuntu 20.04.

I think this puts the EDNS theory back on the table again.

However, my initial mistake did draw my attention to something I didn't previously consider: systemd-resolved itself seems to send its outgoing requests always with a fixed EDNS maximum packet size of 512 bytes, regardless of what the original client requested. That means that the final authoritative nameserver for registry.terraform.io is still seeing the same maximum packet size regardless of which of these builds I use; the difference is only visible on the first hop between Terraform and my local recursive resolver.

If Ubuntu 20.04 in WSL also uses systemd and also has the systemd-resolved address in /etc/resolv.conf then I think that would remove this theory from consideration.

@pacorreia you previously stated that your /etc/resolv.conf contains your local network's gateway address as the configured nameserver, which suggests that systemd-resolved is not involved in your case. Can you confirm? If you are using systemd-resolved then I would expect /etc/resolv.conf to be the one generated by systemd-resolved itself, which has a bunch of specific commentary at the start and then points to a loopback address, like this:

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad

If you aren't using systemd-resolved then I think my previous theory still remains valid, because the behavior of your local network's resolver is not visible to me and so I can't confirm whether it just passes on the EDNS message generated by Terraform or if, like systemd-resolved, it's replacing that with its own smaller value.

(Side note: phew, there are a lot of moving parts here! 😬 )

@pacorreia
Copy link
Author

pacorreia commented Oct 7, 2022

Okay, some more interesting results now that I'm actually monitoring what I intended to monitor. 🙄

The EDNS extension packet is different in each case:

  • Official Terraform v1.2.9: not present at all
  • Official Terraform v1.3.2: present, and advertises maximum response packet size 1232
  • Locally-built Terraform v1.3.2 with CGO_ENABLED=1: present, and advertises maximum response packet size 1200

In the last case, the resolver implementation is the one from my own system's libc, which happens to be Ubuntu glibc 2.31-0ubuntu9.9. So that particular case is likely to vary on other systems with different libc. I'm not sure if the Ubuntu 20.04 image for WSL has the same libc, but I'm guessing probably so since I expect they intend to be binary compatible with "normal" Ubuntu 20.04.

I think this puts the EDNS theory back on the table again.

However, my initial mistake did draw my attention to something I didn't previously consider: systemd-resolved itself seems to send its outgoing requests always with a fixed EDNS maximum packet size of 512 bytes, regardless of what the original client requested. That means that the final authoritative nameserver for registry.terraform.io is still seeing the same maximum packet size regardless of which of these builds I use; the difference is only visible on the first hop between Terraform and my local recursive resolver.

If Ubuntu 20.04 in WSL also uses systemd and also has the systemd-resolved address in /etc/resolv.conf then I think that would remove this theory from consideration.

@pacorreia you previously stated that your /etc/resolv.conf contains your local network's gateway address as the configured nameserver, which suggests that systemd-resolved is not involved in your case. Can you confirm? If you are using systemd-resolved then I would expect /etc/resolv.conf to be the one generated by systemd-resolved itself, which has a bunch of specific commentary at the start and then points to a loopback address, like this:

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad

If you aren't using systemd-resolved then I think my previous theory still remains valid, because the behavior of your local network's resolver is not visible to me and so I can't confirm whether it just passes on the EDNS message generated by Terraform or if, like systemd-resolved, it's replacing that with its own smaller value.

(Side note: phew, there are a lot of moving parts here! 😬 )

So, for WSL2, by default, there's no systemd working unless one tweak some bits to fake it, but for this case, definitely it only relies on what I've set in resolv.conf, no other process involved

@apparentlymart
Copy link
Contributor

Okay, I think I've finally figured out what was going on upstream for these changes in each of the releases relevant to us:

  • net: increase maximum accepted DNS packet to 1232 bytes landed in Go 1.18, and changed the DNS client to silently accept 1232 bytes. This means that Go 1.18 stopped returning an error if an incorrect DNS server returned an oversize response, but it didn't actually advertise that it could accept larger responses.

    This change is included in all of the Terraform v1.2.x series releases, because we adopted Go 1.18 for v1.2.0; Terraform v1.2.9's official packets were built with Go 1.18.1.

  • net: send EDNS(0) packet length in DNS query landed in Go 1.19, and made the DNS client now actually announce that it can accept responses up to 1232 bytes in size. This doesn't change what the DNS client will accept in response, but it is now valid for a compliant nameserver to return larger response packets, which may make some resolvers now return different answers than before.

    This change was new in Terraform v1.3.0, because we adopted Go 1.19 for the official builds in the v1.3.x series. (Terraform v1.3.2 in particular is built with Go 1.19.1.)

@apparentlymart
Copy link
Contributor

Thanks for confirming, @pacorreia!

Unfortunately it seems like it's going to be hard for me to successfully reproduce exactly what's true on your system, now that I know that any intermediate DNS resolver can potentially lower the advertised maximum packet size when it forwards a query. Even if I temporarily disabled systemd-resolved on my system, I'm going to be using a different chain of upstream resolvers than you'd be using on your system and so anything I might observe would only tell me how my upstream DNS servers behave, and not how yours do. 😖

Do you think you have the necessary software, time, and expertise to try to reproduce what I was doing on your system?

I can't give you exact instructions because the details are pretty fiddly and it would take me all evening to write it out 😬 (and I'm at the end of my work week now anyway), but here's a summary of what I was doing here:

  • I used ngrep ngrep -d lo -O FILENAME.pcap "" "port 53" (running as root) to capture port 53 packets to the given file in "pcap" format. I used a different filename for each of the three Terraform executables I tried so I'd be able to compare them.

    (Switching from my ethernet interface to lo in this command line is what I did when I realized I was monitoring systemd-resolved instead of Terraform directly, but since you aren't using resolved I expect you'll actually want to monitor whichever of your interfaces your default route is associated with; on WSL that may be a "tun" or "tap" or other sort of virtualization to bridge out to the real network card in your Windows system, but I dunno the details of how that works.)

  • I used Wireshark to load each of those .pcap files. Wireshark contains IP, UDP, and DNS packet decoders so this makes it easier to peep at the details of the packets and see what's going on, rather than having to manually decode the raw bytes.

  • Broadly speaking, I was just looking at the requests and associated responses for IN A registry.terraform.io and IN AAAA registry.terraform.io to see if anything seemed materially different between them.

    On my system I was primarily concerned with the differences in the requests, and I suppose it would be interesting to first confirm whether you see the same differences as I did above when you try the same two official executables and your local build with CGO_ENABLED=1. The value I was comparing is "UDP payload size" under "Additional records", although for v1.2.9 I expect you won't see "Additional records" appear at all.

    A screenshot of the field I described above in Wireshark, in case you're able to see images

    Assuming you do see similar differences between the query packets, I'd be interested to hear if you see differences in the response packets too. In particular, I'm curious to know whether you see any answers for the IN AAAA registry.terraform.io in each of the cases: my hypothesis above is that when you run with v1.2.9 you'll either no response at all to that question or the response will be different somehow, causing the Go resolver to prefer to use the IPv4 addresses.

If you're not able to try this for any reason then no worries... we can try to find a different way to investigate this. But if you can capture these packets and share what you learn then I think that'll be the most direct way to prove or disprove my theory without having to make any custom builds of Go and Terraform.

I'm about to be away for a long weekend so I'll be quiet for a bit now, but my other colleagues on the team might jump in here if you're able to turn up something which is a good lead to tug on some more. Otherwise, I'll check back in next week. Thanks!

@pacorreia
Copy link
Author

Thanks for confirming, @pacorreia!

Unfortunately it seems like it's going to be hard for me to successfully reproduce exactly what's true on your system, now that I know that any intermediate DNS resolver can potentially lower the advertised maximum packet size when it forwards a query. Even if I temporarily disabled systemd-resolved on my system, I'm going to be using a different chain of upstream resolvers than you'd be using on your system and so anything I might observe would only tell me how my upstream DNS servers behave, and not how yours do. 😖

Do you think you have the necessary software, time, and expertise to try to reproduce what I was doing on your system?

I can't give you exact instructions because the details are pretty fiddly and it would take me all evening to write it out 😬 (and I'm at the end of my work week now anyway), but here's a summary of what I was doing here:

  • I used ngrep ngrep -d lo -O FILENAME.pcap "" "port 53" (running as root) to capture port 53 packets to the given file in "pcap" format. I used a different filename for each of the three Terraform executables I tried so I'd be able to compare them.

    (Switching from my ethernet interface to lo in this command line is what I did when I realized I was monitoring systemd-resolved instead of Terraform directly, but since you aren't using resolved I expect you'll actually want to monitor whichever of your interfaces your default route is associated with; on WSL that may be a "tun" or "tap" or other sort of virtualization to bridge out to the real network card in your Windows system, but I dunno the details of how that works.)

  • I used Wireshark to load each of those .pcap files. Wireshark contains IP, UDP, and DNS packet decoders so this makes it easier to peep at the details of the packets and see what's going on, rather than having to manually decode the raw bytes.

  • Broadly speaking, I was just looking at the requests and associated responses for IN A registry.terraform.io and IN AAAA registry.terraform.io to see if anything seemed materially different between them.

    On my system I was primarily concerned with the differences in the requests, and I suppose it would be interesting to first confirm whether you see the same differences as I did above when you try the same two official executables and your local build with CGO_ENABLED=1. The value I was comparing is "UDP payload size" under "Additional records", although for v1.2.9 I expect you won't see "Additional records" appear at all.

    A screenshot of the field I described above in Wireshark, in case you're able to see images

    Assuming you do see similar differences between the query packets, I'd be interested to hear if you see differences in the response packets too. In particular, I'm curious to know whether you see any answers for the IN AAAA registry.terraform.io in each of the cases: my hypothesis above is that when you run with v1.2.9 you'll either no response at all to that question or the response will be different somehow, causing the Go resolver to prefer to use the IPv4 addresses.

If you're not able to try this for any reason then no worries... we can try to find a different way to investigate this. But if you can capture these packets and share what you learn then I think that'll be the most direct way to prove or disprove my theory without having to make any custom builds of Go and Terraform.

I'm about to be away for a long weekend so I'll be quiet for a bit now, but my other colleagues on the team might jump in here if you're able to turn up something which is a good lead to tug on some more. Otherwise, I'll check back in next week. Thanks!

many many thanks for this great work you did, and yes I got the idea of what you did. and 2ill try to follow, here is already 1:48 AM. Tomorrow will try to capture those results and get conclusions to share 👌

@pacorreia
Copy link
Author

@apparentlymart So I followed your trail and was able to capture traffic and look for the fields you mentioned.

Indeed for 1.2.9 there's no "Additional records" appearing, but for the 1.3.1 release, there's and in my case the UDP payload size was: 1232.

About the responses, there's a clear difference:
1.2.9:
image

1.3.1:
image

@pacorreia
Copy link
Author

Found the reason why terraform was failing for me with WSL2, although, it shouldn't.

By default WSL2 sets a NAT interface on the host and shares host Internet with VMs.
The default in WSL2 for name server is to use the host NAT interface IP, like 172.29.97.1 for example.

So my setup is actually using my home gateway as nameserver, bypassing the host NAT for WSL.

As soon I changed back to use the host NAT IP address as nameserver, Terraform 1.3.0 and latest versions, started working.

Still I'm intrigued why the other way breaks the way Go gets and uses the results.

I'll run the packet capture with this new change and post the results later

@pacorreia
Copy link
Author

My conclusion on this is, as long one uses the default configuration for the nameserver to point to the host IP, it will work.

It's interesting that using a different config, breaks the normal behavior.

I'll close this issue, as it works under WSL2, it just does not like custom dns configuration

@apparentlymart
Copy link
Contributor

Thanks for following up, @pacorreia!

It does sound strange to me that using your normal resolver would cause different behavior but I have a guess as to why: perhaps the intermediate resolver normally used in WSL knows that the WSL environment doesn't have a functioning IPv6 interface and so it locally filters out the AAAA records in the response to pretend to all software running inside WSL that there are no IPv6 addresses on the internet.

By bypassing that local resolver you allowed Terraform to see that there is an IPv6 address available and then Terraform tried to connect to it.

It isn't clear to me yet why the usual fallback to IPv4 didn't work here, but I suspect that's probably an artifact of how WSL virtualizes the network connection: it is perhaps responding in a different way than is typical for a slow or non-functional IPv6 connection, which is then causing the Go network implementation to treat it as a fatal error rather than falling back to a different address.

I'm glad we have an explanation at least, even if it's an incomplete one, and that you found a configuration that works. Thanks again!

@pacorreia
Copy link
Author

@apparentlymart

For own curiosity, I grabbed the packets again.
So with old nameserver for tf 1.3.1 this was here are the dns requests:
image

the working nameserver, same tf version:
image

@github-actions
Copy link
Contributor

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug new new issue not yet triaged v1.3 Issues (primarily bugs) reported against v1.3 releases
Projects
None yet
Development

No branches or pull requests

4 participants