Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for proxmox_virtual_environment_vm's ipv4_addresses does not really work #100

Closed
otopetrik opened this issue Jul 27, 2022 · 5 comments · Fixed by #115 or #345
Closed

Waiting for proxmox_virtual_environment_vm's ipv4_addresses does not really work #100

otopetrik opened this issue Jul 27, 2022 · 5 comments · Fixed by #115 or #345
Labels
🐛 bug Something isn't working

Comments

@otopetrik
Copy link
Contributor

Describe the bug
Waiting for any IP address is not enough. The fastest available one (IPv6 link-local) is usually not enough.
Even in case of actual IPv4 and IPv6 addresses, (random) one of them would be acquired first, making it impossible to reliably connect/provision.

To Reproduce
Steps to reproduce the behavior:

  1. Clone VM from a template that supports both IPv4 and IPv6
  2. Fail to connect for provisioning
  3. See error:
proxmox_virtual_environment_vm.example: Still creating... [40s elapsed]
╷
│ Error: Error in function call
│ 
│   on main.tf line 51, in resource "proxmox_virtual_environment_vm" "example":
│   51:     host = element(element(self.ipv4_addresses, index(self.network_interface_names, "eth0")), 0)
│     ├────────────────
│     │ self.ipv4_addresses is list of list of string with 2 elements
│     │ self.network_interface_names is list of string with 2 elements
│ 
│ Call to function "element" failed: cannot use element function with an empty list.
  1. Discover that
    if nic.IPAddresses == nil || (nic.IPAddresses != nil && len(*nic.IPAddresses) == 0) {
    waits for any (IPv4 or IPv6) address on non-loopback interface
  2. Realize that provisioning step
    host = element(element(self.ipv4_addresses, index(self.network_interface_names, "eth0")), 0)
    has to pick either IPv4 or IPv6 and then needs provider to wait until agent reports suitable address.

Expected behavior
The following line

host = element(element(self.ipv4_addresses, index(self.network_interface_names, "eth0")), 0)

should work...

Screenshots
excerpt from terraform.tfstate

           "ipv4_addresses": [
              [
                "127.0.0.1"
              ],
              []
            ],
            "ipv6_addresses": [
              [
                "::1"
              ],
              [
                "fe80::c88a:70ff:fe37:225f"
              ]
            ],

At this time, the machine can be connected only over IPv6, and using link-local address. Chances are even in IPv6-only scenario, there is a router between the machine running terraform and the VM.

Additional context
In

func (c *VirtualEnvironmentClient) WaitForNetworkInterfacesFromVMAgent(ctx context.Context, nodeName string, vmID int, timeout int, delay int, waitForIP bool) (*VirtualEnvironmentVMGetQEMUNetworkInterfacesResponseData, error) {
the waitForIP boolean is not enough.
There should more complex condition.
Probably adding something like the following to proxmox_virtual_environment_vm's agent block:

agent {
  enabled = true
  expect_networking {
    ipv4 = true # will wait for IPv4 address even though it already has IPv6 address
    ipv6_link = false # will wait for IPv6, and consider fe80::/10 good enough
    ipv6 = false # will wait for IPv6 address not in fe80::/10
  }
}

(defaulting only ipv4 to true seems reasonable)
The provider should wait until all condition marked true are met.

Even better solution would be to repeat expect_networking block per-interface:

agent {
  enabled = true
  expect_networking {
    interface = "eth0"
    ipv4 = true
  }
  expect_networking {
    interface = "eth2"
    ipv4 = false
    ipv6 = true
  }
}

This example would result in waiting until eth0 has an IPv4 address and eth2 has non-link-local IPv6 address (even in if eth1 interface already has IPv4 and IPv6 addresses).
This could be useful for VMs that have static addresses on some interfaces (e.g. router VMs,...)

Omitting interface field should allow any interface to fulfil the condition imposed by the block.
Then the first example would be a bit stricter - it would mean that there has to be one interface that has both IPv4 and IPv6 addresses.
Weaker condition, requiring IPv4 and IPv6 addresses, but optionally on different interfaces would look like this:

agent {
  enabled = true
  expect_networking {
    ipv4 = true
    ipv6 = false
  }
  expect_networking {
    ipv4 = false
    ipv6 = true
  }
}

(VM with single network interface that has both IPv4 and IPv6 would also fulfil this condition)

@otopetrik otopetrik added the 🐛 bug Something isn't working label Jul 27, 2022
@otopetrik
Copy link
Contributor Author

otopetrik commented Jul 30, 2022

I thought about it a bit more, and it looks like it would be better to have waiting requirements on interfaces themselves, something like:

resource "proxmox_virtual_environment_vm" "example_vm" {
  network_device {
    wait_for_ipv4 = [ ... ]
    wait_for_ipv6 = [ ... ]
  }
  network_device {
    wait_for_ipv4 = [ ... ]
    wait_for_ipv6 = [ ... ]
  }
}

Instead of simple true or false, there should be a list of rules.
Each rule consists of list of blocks - optional reject blocks at the beginning of the list, followed by optional accept blocks.
Rule is satisfied by given IP address if:

  1. no reject blocks rejects it
  2. if there are accept blocks, at least one has to accept it.

In other words, rule is satisfied by given IP address if:

  • the list of blocks is empty
  • no reject block rejects the IP address and there are no accept blocks
  • no reject block rejects the IP address and one of the accept blocks accepts the IP address.

Each rule is a requirement for one IP address of the interface. Once an IP address is used to satisfy a rule,
it is not considered for satisfying following rules.

Empty list of rules means "do not wait for address" on this interface:

wait_for_ipv4 = [] # do not wait for ipv4 address

A list of rules containing empty list the only rule means "wait for any address":

wait_for_ipv4 = [
  [], # rule accepts any IP address
]

Accept blocks (IPv4)

  • exact IPv4 address, examples:
    • "192.168.0.1" - mainly to be used as check, that static IP configuration inside VM matches what was configured using cloud-init)
  • network, within which the IPv4 address must lie, examples:
    • "192.168.0.0/16" - require IPv4 address to lie in that network
    • "169.254.0.0/16" - require link-local autoconfigured IPv4 address
  • any IPv4 address
    • "0.0.0.0/0"
  • alias
    • "any" same as "0.0.0.0/0"
    • "private" same as having following blocks "10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16"
    • "link" same as "169.254.0.0/16"

Accept blocks (IPv6)

  • exact IPv6 address, examples:
    • "fda2:e0a5:c45:ffff::1" - mainly to be used as check, that static IP configuration inside VM matches what was configured using cloud-init)
  • network, within which the IPv6 address must lie, examples:
    • "fda2:e0a5:c45:ffff::/64" - require IPv4 address to lie in that network
    • "fe80::/64" - require link-local autoconfigured IPv6 address
    • "fc00::/7" - require unique private IPv6 address (RFC4193)
  • any IPv6 address
    • "::"
  • alias
    • "any" same as "::"
    • "private" same as "fc00::/7"
    • "link" same as "fe80::/64"

Reject blocks

  • negated accept block by prefixing "!"
    • "!192.168.0.1" rule rejects this specific IP address
    • "!192.168.0.0/16" rule rejects IP address from that network
    • "!private" same as having following blocks "!10.0.0.0/8", "!172.16.0.0/12", "!192.168.0.0/16" (reject IP address in any of RFC1918 ranges)
  • "!*" reject anything that would be accepted by an accept block of previous rules
    • Like repeating accept blocks from all previous rules but negated
    • Main use-case is to reject any additional IPv6 link-local or private addresses when requiring global IPv6 address in last rule of the rule list

Note that "any" is not that useful, it is automatically implied in rule list that contains only reject blocks.

Defaults

Default should be:

  network_device {
    wait_for_ipv4 = [
      [], # wait for any IPv4 address
    ]
    wait_for_ipv6 = [ ] # do not wait for IPv6 address
  }

Examples

Wait for both IPv4 and IPv6 addresses that imply some any connectivity

  network_device {
    wait_for_ipv4 = [
      [ "!link" ],  # wait for IPv4 address except 192.254.0.0/16
   ]
    wait_for_ipv6 = [
      [ "!link" ],  # wait for IPv6 address except fe80::/64
    ]
  }

Wait for private IPv6 addresses

  network_device {
    wait_for_ipv6 = [
      [ "private" ],  # wait for IPv6 address from "fc00::/7"
    ]
  }

Wait for global IPv6 addresses

  network_device {
    wait_for_ipv6 = [
      [ "!link", "!private" ],  # wait for IPv6 address except from "fe80::/64" or "fc00::/7"
    ]
  }

or

  network_device {
    wait_for_ipv6 = [
      [ "link" ],
      [ "!private", "!*" ],  # wait for IPv6 address except from "fc00::/7" and anything that would match the earlier rules
    ]
  }

Wait for two different private IPv6 addresses
(only useful if IPv6 temporary addresses are disabled)

  network_device {
    wait_for_ipv6 = [
      [ "private" ], # the IPv6 address that matches this rule will not be considered for the following rule...
      [ "private" ],
    ]
  }

Wait for two different private IPv6 addresses out of three possible networks, and a global IPv6
(e.g. at least two out of three routers in a network are working, and there is global connectivity)

  network_device {
    wait_for_ipv6 = [
      [ "fdf2:f3bf:406c::/48", "fd8e:eb98:4b15::/48", "fd00:83ec:7b96::/48" ],
      [ "fdf2:f3bf:406c::/48", "fd8e:eb98:4b15::/48", "fd00:83ec:7b96::/48" ],
      [ "!link", "!*" ],
    ]
  }

Details

Matching network_device configuration with qemu guest agent output would have to be done using MAC addresses.
If user changes MAC address inside VM, then wait_for_ipv4/wait_for_ipv6 would have to be set to empty list for that interface, to avoid waiting forever of interface with MAC address that will not appear.

If agent.enabled is false, it will obviously not wait for anything.

It is possible to wait for:

  • IPv4
  • IPv6
  • IPv4 and IPv6

It is not possible to wait for IPv4 or IPv6. I have not found a use case that would actually require it.

Possible additional use

It could be useful to provide outputs waited_for_ipv4 and waited_for_ipv6.
Their order would match network_device declarations - no need to search for "eth0" or "ens18", etc...
For each interface the list would contain one IP address per rule (in order).

resource "proxmox_virtual_environment_vm" "example_vm" {
  network_device {
    wait_for_ipv6 = [
      [ "link" ],
      [ "private" ],
      [ "!*" ],
    ]
  }
}
locals {
  link_ipv6_address = proxmox_virtual_environment_vm.example_vm.waited_for_ipv6[0][0]
  private_ipv6_address = proxmox_virtual_environment_vm.example_vm.waited_for_ipv6[0][1]
  global_ipv6_address = proxmox_virtual_environment_vm.example_vm.waited_for_ipv6[0][2]
  global_ipv6_address_another_way = proxmox_virtual_environment_vm.example_vm.waited_for_ipv6[0][-1]
}

Given that last wait_for rule would generally match the most non-local (either private or global) IPv6 address, using
waited_for_ipv6[0][-1] as default ip address to use for external connections makes sense. Maybe even using waited_for_ipv6[0][-2] for configuring "internal" links between VMs.

e.g. In production environment [-1] is global, [-2] is private. In dev/test [-1] is private [-2] is link-local.

Note that qemu geust agent does not provide information about which IP addresses are temporary, therefore using waited_for_ipv6 would be reliable only if temporary addresses are disabled in the VM. Or if qemu guest agent is extended.

Why the long comment ?

Before attempting to write a patch, I'm looking for a feedback.
Is there a use case I have missed ? Is there a flaw in my reasoning ? etc...

@bpg
Copy link
Owner

bpg commented Aug 2, 2022

Thanks @otopetrik, this is really informative. Let me think about this problem for a little bit.

@bpg
Copy link
Owner

bpg commented Aug 8, 2022

I'm not sure if the full-featured set of rules per interface and rule matching logic is the best way to go. It also seems waiting for either (not both) of IPv4 or IPv6 is a valid use case during VM provisioning -- the first address returned can be used to establish a control connection to the VM for further customization.

Though I agree that a simple wait for any address is not enough. We may need to make sure the address is routable, as well as we may need to filter out some addresses.

I was looking around how other Terraform provides approached this problem, it looks like vsphere provider might be a good example. Something like combination of VM options wait_for_guest_net_routable and ignored_guest_ips might be sufficient to cover majority of use cases. Implementation is in WaitForGuestNet

But anyway, this is just my opinion, I don't have IPv6 set up in my proxmox environments, so can't really test the use cases you are describing and probably am missing some important details.

@otopetrik
Copy link
Contributor Author

I'm not sure if the full-featured set of rules per interface and rule matching logic is the best way to go. It also seems waiting for either (not both) of IPv4 or IPv6 is a valid use case during VM provisioning -- the first address returned can be used to establish a control connection to the VM for further customization.

Even assuming link-local addresses are ignored, it would work fully only in IPv4-only networks or IPv6-only networks.

In mixed networks, there is no way to know, if DHCPv4 server response will be faster than DHCPv6/RA response.
One of ipv4_addresses/ipv6_addresses outputs for each interface will be an empty list.

For provisioning, it would likely be possible to work around it using conditional expressions.
But it would still mean that either IPv4 address or IPv6 address of the VM is unavailable to the rest of the Terraform configuration.

Though I agree that a simple wait for any address is not enough. We may need to make sure the address is routable, as well as we may need to filter out some addresses.

I was looking around how other Terraform provides approached this problem, it looks like vsphere provider might be a good example. Something like combination of VM options wait_for_guest_net_routable and ignored_guest_ips might be sufficient to cover majority of use cases. Implementation is in WaitForGuestNet

That seems to use routing table from inside the VM. Qemu guest agent provides only MAC address, interface name, and list of IP addresses. There is not a portable way to get default gateway address, and running guest-exec to get default gateway would be fragile.

But something like configurable wait_for_guest_net_timeout seems like a good idea. Currently even Ctrl-C does not work (it must be pressed second time).

But anyway, this is just my opinion, I don't have IPv6 set up in my proxmox environments, so can't really test the use cases you are describing and probably am missing some important details.

I have noticed this bug on an IPv4-only network, when VM provisioning sometimes failed. I thought that ignoring the link-local address would not fix the issue completely.

But it looks like filtering 169.254.0.0/16 and fe80::/10 might be close enough for most cases.
And implementing a complex rule set can be postponed until it is absolutely needed.

It fixes the immediate bug - it should be possible to provision VM templates with IPv4 and IPv6 enabled:

  • on IPv4-only or IPv6-only networks it should just work
  • on IPv4+IPv6 networks it should be possible using conditional expressions, but it would not be possible to depend on ipv4_addresses or ipv6_addresses in the remainder of the configuration.

@bpg bpg closed this as completed in #115 Aug 15, 2022
otopetrik added a commit to otopetrik/terraform-provider-proxmox that referenced this issue May 23, 2023
VM can get IPv6 link-local address faster than a DHCP server response,
that results in 'ipv4_addresses' output being an empty list.
It is then impossible to provision the VM using 'connection.host' field
derived from 'self.ipv4_addresses'.

Once again change waiting for IP address to wait for better address than
IPv4 link-local addresses and IPv6 link-local addresses.

Should not break bpg#182, because it requires only one GlobalUnicast address per VM.
@otopetrik
Copy link
Contributor Author

After long time updated to a newer version, the bug reappeared.

Looks like the #182 reintroduced it.

It completely ignores the result of ip.IsGlobalUnicast call.

The missingIP is initialized to false, and after the #182 changes it is set to true only if a interface has no addresses at all.

  • Loopback always has an ip address, therefore it does not change missingIP to true
  • eth0 usually gets link-local first IPv6, at that point it no longer sets missingIP to true

Variable missingIP stays false, expression !missingIP is true, and condition

if hasAnyGlobalUnicast || !missingIP {

is true, because of the right side of || operator is true.
All without any dependency on whether there is any IP address, that actually satisfies IsGlobalUnicast or not.

Relevant code from terraform.tfstate

"ipv4_addresses": [
  [
    "127.0.0.1"
  ],
  []
],
"ipv6_addresses": [
  [
    "::1"
  ],
  [
    "fe80::acd:efff:feab:cdef"
  ]
],
"network_interface_names": [
  "lo",
  "eth0"
],

At this point terraform fails again on:

host = element(element(self.ipv4_addresses, index(self.network_interface_names, "eth0")), 0)

as described in the original issue.

bpg pushed a commit that referenced this issue May 23, 2023
…#345)

fix: Wait for 'net.IsGlobalUnicast' IP address, again (#100)

VM can get IPv6 link-local address faster than a DHCP server response,
that results in 'ipv4_addresses' output being an empty list.
It is then impossible to provision the VM using 'connection.host' field
derived from 'self.ipv4_addresses'.

Once again change waiting for IP address to wait for better address than
IPv4 link-local addresses and IPv6 link-local addresses.

Should not break #182, because it requires only one GlobalUnicast address per VM.
@ghost ghost mentioned this issue May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
2 participants