Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No ip addresses are provided with enabled agent #776

Open
BaldFabi opened this issue Dec 6, 2023 · 16 comments
Open

No ip addresses are provided with enabled agent #776

BaldFabi opened this issue Dec 6, 2023 · 16 comments
Labels
acknowledged 🐛 bug Something isn't working

Comments

@BaldFabi
Copy link

BaldFabi commented Dec 6, 2023

Describe the bug
I try to clone a vm from a template with an enabled agent. I've found the issue #100 which looks like it's related to my problem.
It looks like the provider doesn't wait long enough (or something like that) because the ip is displayed in the Proxmox gui at the vm summary. Obviously the ip is not instantly available to Proxmox but after a couple of seconds after the vm started.

To Reproduce
Steps to reproduce the behavior:

  1. Create a template which has the qemu agent already installed
  2. Create a config that clones that created template with an enabled agent directive
  3. The clone will fail and no ip is saved in the state file. But next to the vm the ip is displayed (as shown in the screenshot)
resource "proxmox_virtual_environment_vm" "machinexyz" {
  name      = "machinexyz"
  node_name = "server01"

  operating_system {
    type = "l26"
  }

  on_boot = true

  clone {
    vm_id = 912
  }

  agent {
    enabled = true
  }

  memory {
    dedicated = 4096
  }

  cpu {
    cores = 4
    type  = "x86-64-v2-AES"
  }

  disk {
    datastore_id = "pool1"
    size         = 20
    interface    = "scsi0"
  }

  connection {
    type     = "ssh"
    user     = "root"
    password = local.root_password
    host     = self.ipv4_addresses[0]
  }
}

Expected behavior
The provider should wait the defined (or default) value of the timeout option

Screenshots
IP in the Proxmox GUI
image

The error

╷
│ Error: Attempt to index null value
│
│   on machine.tf line 45, in resource "proxmox_virtual_environment_vm" "machine":
│   45:     host     = self.ipv4_addresses[0]
│     ├────────────────
│     │ self.ipv4_addresses is null
│
│ This value is null, so it does not have any indices.

Additional context

  • It's a single instance Proxmox server
  • Provider version 0.39.0
  • Terraform version v1.5.7
  • OS: MacOS
@bpg bpg added the 🐛 bug Something isn't working label Dec 6, 2023
@BaldFabi
Copy link
Author

BaldFabi commented Dec 10, 2023

I just tried some things and found out that a previous warning I also had is the reason for this. My template had the iothread option set on the harddisk.
After removing it the ipv4_addresses attribute wasn't null anymore.
It's a little bit weird that the warning causes this.

Edit: And at the moment I don't have a clue how the ipv4_addresses is structured.

@bpg
Copy link
Owner

bpg commented Dec 12, 2023

I just tried some things and found out that a previous warning I also had is the reason for this. My template had the iothread option set on the harddisk.

That could be related to #360, changing disk attributes while cloning might not always work as expected.
But I'm also curios what what the "previous warning" that you also saw. Do you have it captured somewhere, by any chance?

Also, you many want to skip the disk block in the clone if you just want to use the disk from the template.

Edit: And at the moment I don't have a clue how the ipv4_addresses is structured.

You can check your local terraform.tfstate, for my test VM it is

            "ipv4_addresses": [
              [
                "127.0.0.1"
              ],
              [
                "192.168.3.205"
              ]
            ],

@vrcdx64
Copy link

vrcdx64 commented Dec 12, 2023

Hello,

  • Proxmox single node 8.0.3
  • Provider version 0.40.0
  • Terraform 1.6.5

I'm learning Terraform and I have exactly the same problem as described by the author. My template, with enabled qemu-agent, doesn't use iothread (default value is false).

If i check the values of ipv4_addresses, ipv6_addresses or network_interface_names after the error, in TF, the list are empty. But on the Proxmox web UI the values are here. I've tried to adjust the agent timeout value but the default is 15m which is enough.

ll see if I can dig into the problem. Don't hesitate to ask me if I can help to troubleshoot.

@BaldFabi
Copy link
Author

I just rerun Terraform with the iothread attribute set on the template to trigger the warning again

╷
│ Warning: the VM startup task finished with a warning, task log:
│
│       | WARN: iothread is only valid with virtio disk or virtio-scsi-single controller, ignoring
│       | TASK WARNINGS: 1
│
│   with proxmox_virtual_environment_vm.machine,
│   on machine.tf line 1, in resource "proxmox_virtual_environment_vm" "machine":
│    1: resource "proxmox_virtual_environment_vm" "machine" {
│
╵

Also, you many want to skip the disk block in the clone if you just want to use the disk from the template.

But if I skip the disk block the disk wouldn't be cloned right?
Otherwise I would have to recreate the template each time I want to provision a new vm or skip this step and provision and install it via iso.

You can check your local terraform.tfstate, for my test VM it is

            "ipv4_addresses": [
              [
                "127.0.0.1"
              ],
              [
                "192.168.3.205"
              ]
            ],

Thats a good hint. I did a rerun without the iothread attribute to prevent the warning again.
The ipv4_addresses in my state file are now like yours. Wouldn't it make sense to purge 127.0.0.1 and simplify the slice to be one dimensional?

@otopetrik
Copy link
Contributor

Wouldn't it make sense to purge 127.0.0.1 and simplify the slice to be one dimensional?

Probably not. There are use cases for VMs with multiple interfaces (router, internal cluster networks, etc...), and even some use cases for one interface to have multiple addresses (high-availability using virtual ip).

The provider waits for one "reasonable" ip address (i.e. better than link-local), this fixes the original issue, where link-local ipv6 address was obtained the faster than ipv4 from DHCP server.

In cases where waiting for multiple interfaces/addresses is required, it should be possible to delay starting the qemu-guest-agent inside the VM until all addresses are obtained (by modifying guest agent's systemd unit dependencies).

ipv4_addresses data is taken directly from qemu-guest-agent, which reports all interfaces (including loopback) and uses names used by the system inside the VM (i.e. not "net0" but "eth0","eno1", "enp5s0" and likely even language-specific names in case of windows VMs).

Using fixed index like self.ipv4_addresses[0] does not really work.

Using element(element(self.ipv4_addresses, index(self.network_interface_names, "eth0")), 0) should work - assuming that the interface inside the VM is "eth0", and not enp5s0 or similar.

It might be useful to add something like ipv4_addresses_by_device[], which would use mac addresses of configured network devices to find matching ip addresses from qemu-guest-agent output, then ipv4_addresses_by_device[0] would really mean ipv4 address of 'net0' network device of the VM.

(Changing behavior of existing ipv4_addresses is probably not a good idea. It would break existing configurations and it can be useful to have access to IP addresses assigned to non-hardware interfaces - e.g. VPN, PPPoE,...)

@Sorixelle
Copy link

Sorixelle commented Feb 10, 2024

I ran into this issue today, where the state refresh was timing out, and ipv4_addresses etc. were empty after an apply. The issue turned out to be that I had not granted the Proxmox user I configured the provider with the VM.Monitor privilege, which seems to be required to be able to retrieve this information. Just dropping this one in here, in case anyone runs into the same problem.

I wonder, could this be handled better? The API route being called was returning a 403 response in this case, so it would be possible for the provider to catch this case and show an error message to the user. If that's desired, I can open a separate issue to track that.

@bpg
Copy link
Owner

bpg commented Feb 14, 2024

Hi @Sorixelle! 👋🏼

Thanks for sharing your use case. That's a good suggestion, the provider can definitely handle this type of errors better.
Please go ahead and open a separate issue for this enhancement, much appreciated!

@bpg-autobot
Copy link
Contributor

bpg-autobot bot commented Aug 13, 2024

Marking this issue as stale due to inactivity in the past 180 days. This helps us focus on the active issues. If this issue is reproducible with the latest version of the provider, please comment. If this issue receives no comments in the next 30 days it will automatically be closed. If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!

@bpg-autobot bpg-autobot bot added the stale label Aug 13, 2024
@bpg bpg added acknowledged and removed stale labels Aug 13, 2024
@Moortu
Copy link

Moortu commented Oct 7, 2024

I was experiencing this issue.

I followed the advice above:
my proxmox user had the vm.monitor permission
I disabled the iothread,

When I ran the terraform script a 2nd time, after it has created the vms, but failed to get the ip addresses, it did find the ipv4_addresses.

What for me the problem was is that I enabled uefi and tpm, but I didn't have the efi_disk.
This gave warnings, and it used efivars, but was not a blocker apparently.

After I added the efidisk, no warnings, and it did get the ipv4_addresses and everything.

repo: https://github.com/Moortu/terraform-proxmox-talos-k8s

@neutralalice
Copy link

Using element(element(self.ipv4_addresses, index(self.network_interface_names, "eth0")), 0) should work - assuming that the interface inside the VM is "eth0", and not enp5s0 or similar.

I gave this a shot, and found that on first apply, I'd get Call to function "index" failed: cannot search an empty list. While every apply after is fine.

@Moortu
Copy link

Moortu commented Oct 22, 2024

Using element(element(self.ipv4_addresses, index(self.network_interface_names, "eth0")), 0) should work - assuming that the interface inside the VM is "eth0", and not enp5s0 or similar.

I gave this a shot, and found that on first apply, I'd get Call to function "index" failed: cannot search an empty list. While every apply after is fine.

I had this as well.
This seems to indicate a wrong configuration in my experience.
Do you use secure boot?

Or are you getting any warnings in your log?

@neutralalice
Copy link

neutralalice commented Oct 22, 2024

Using element(element(self.ipv4_addresses, index(self.network_interface_names, "eth0")), 0) should work - assuming that the interface inside the VM is "eth0", and not enp5s0 or similar.

I gave this a shot, and found that on first apply, I'd get Call to function "index" failed: cannot search an empty list. While every apply after is fine.

I had this as well. This seems to indicate a wrong configuration in my experience. Do you use secure boot?

Or are you getting any warnings in your log?

Your setup is actually similar to mine. Downloading talos with the qemu-guest-agent extention from image factory.

no uefi/secure boot. no warnings that appear related to me other than the error for the above function call

I see in your setup, you actually have a 5second wait for outputs, so I'll give that a try.

Edit: That didn't work.

@Moortu
Copy link

Moortu commented Oct 23, 2024

Using element(element(self.ipv4_addresses, index(self.network_interface_names, "eth0")), 0) should work - assuming that the interface inside the VM is "eth0", and not enp5s0 or similar.

I gave this a shot, and found that on first apply, I'd get Call to function "index" failed: cannot search an empty list. While every apply after is fine.

I had this as well. This seems to indicate a wrong configuration in my experience. Do you use secure boot?
Or are you getting any warnings in your log?

Your setup is actually similar to mine. Downloading talos with the qemu-guest-agent extention from image factory.

no uefi/secure boot. no warnings that appear related to me other than the error for the above function call

I see in your setup, you actually have a 5second wait for outputs, so I'll give that a try.

Edit: That didn't work.

Any warning from the bgp provider or talos will most likely cause the problem. Even if it seems unrelated.

Can you share your repo?

@neutralalice
Copy link

neutralalice commented Oct 23, 2024

https://github.com/neutralalice/talos-on-proxmox

I went ahead and tried all the other things listed and still get the issue on first apply.

I'm not yet at the point where nodes are joining together; this is just populating them on proxmox.

With TF_LOG=WARN The only error I see other than the the element output function call error is.

2024-10-23T23:39:14.737+0100 [WARN] Provider "provider[\"registry.opentofu.org/bpg/proxmox\"]" produced an unexpected new value for module.control_planes.proxmox_virtual_environment_vm.node[2], but we are tolerating it because it is using the legacy plugin SDK.

Edit: @Moortu I had another chance to look at this and it ended up being strictly because the qemu guest agent had not yet started reporting out the interfaces. By upping the time delay of output to 10seconds, I would usually get the IPv4 address, but not the ipv6 address. By increasing it to 15-20s, I get all of the ipv4 and most(but not always all!) of the ipv6 address. 25seconds seems to be enough time for me to always get the ip addresses on first apply.

@Moortu
Copy link

Moortu commented Oct 26, 2024

https://github.com/neutralalice/talos-on-proxmox

I went ahead and tried all the other things listed and still get the issue on first apply.

I'm not yet at the point where nodes are joining together; this is just populating them on proxmox.

With TF_LOG=WARN The only error I see other than the the element output function call error is.

2024-10-23T23:39:14.737+0100 [WARN] Provider "provider[\"registry.opentofu.org/bpg/proxmox\"]" produced an unexpected new value for module.control_planes.proxmox_virtual_environment_vm.node[2], but we are tolerating it because it is using the legacy plugin SDK.

Edit: @Moortu I had another chance to look at this and it ended up being strictly because the qemu guest agent had not yet started reporting out the interfaces. By upping the time delay of output to 10seconds, I would usually get the IPv4 address, but not the ipv6 address. By increasing it to 15-20s, I get all of the ipv4 and most(but not always all!) of the ipv6 address. 25seconds seems to be enough time for me to always get the ip addresses on first apply.

that agent timeout behaved strangely for me.
I would keep it at a few minutes at least.
default is 15m

@neutralalice
Copy link

that agent timeout behaved strangely for me. I would keep it at a few minutes at least. default is 15m

Yea, this does seem to be it with no need for a sleep timer. I had the timeout really low because I used to load the guest agent in via cloud-init, before image factory came along - so I had a really low timeout to compensate; looks like the sleep timer resource gave just enough buffer to get a qemu response.

Moving on from that, I wonder if there is something in this function that can make waiting for the interface IP addresses to be more robust. As mentioned, without a sleep timer after resource creation, I only seem to be getting the an ipv4 address and a link-local ipv6, but I'm actually only interested in the global ipv6 address since the rest of my network runs ipv6 (mostly non dual-stack).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledged 🐛 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants