Bunch of (possible) problems with qemu-guest-agent #669

ratiborusx · 2023-10-30T20:41:33Z

Describe the bug

In case agent is not installed and 'agent.enabled = true' i get increased time for VM resource creation which uses such settings. It looks like provider tries to receive a response from an agent and hangs for "agent.timeout" value.
Same things would happen on any plan update (terraform plan/apply command).

Here I created a VM ("debian-test-vm") with "agent.enabled = true", "agent.timeout = 30s", "reboot = false" and no actual agent being installed inside guest.
Then I just ran plan generation which hanged for that value (plus some overhead i guess):

ratiborus@HOMEWORLD:~/WORKSPACE/terraform_yandex/proxmox$ time terraform plan
module.proxmox.data.proxmox_virtual_environment_vms.template_vms: Reading...
module.proxmox.data.proxmox_virtual_environment_nodes.cluster_nodes: Reading...
module.proxmox.proxmox_virtual_environment_pool.templates: Refreshing state... [id=templates]
module.proxmox.proxmox_virtual_environment_file.px_cloud_image["astra-1.7.4-base"]: Refreshing state... [id=local:iso/astra-1.7.4-base.img]
module.proxmox.proxmox_virtual_environment_file.px_cloud_image["debian-12"]: Refreshing state... [id=local:iso/debian-12.img]
module.proxmox.proxmox_virtual_environment_file.px_cloud_image["ubuntu-22.04"]: Refreshing state... [id=local:iso/ubuntu-22.04.img]
module.proxmox.data.proxmox_virtual_environment_nodes.cluster_nodes: Read complete after 0s [id=nodes]
module.proxmox.proxmox_virtual_environment_file.cloud_config_userdata_raw_file["prox-srv1"]: Refreshing state... [id=local:snippets/userdata-proxmox-raw-file.yml]
module.proxmox.proxmox_virtual_environment_file.cloud_config_userdata_raw_file["prox-srv3"]: Refreshing state... [id=local:snippets/userdata-proxmox-raw-file.yml]
module.proxmox.proxmox_virtual_environment_file.cloud_config_userdata_raw_file["prox-srv2"]: Refreshing state... [id=local:snippets/userdata-proxmox-raw-file.yml]
module.proxmox.data.proxmox_virtual_environment_vms.template_vms: Read complete after 0s [id=d4042208-f38b-480e-a15e-97bee8a9829c]
module.proxmox.proxmox_virtual_environment_vm.px_template["ubuntu-22.04"]: Refreshing state... [id=1011]
module.proxmox.proxmox_virtual_environment_vm.px_template["debian-12"]: Refreshing state... [id=1021]
module.proxmox.proxmox_virtual_environment_vm.px_template["astra-1.7.4-base"]: Refreshing state... [id=1001]
module.proxmox.proxmox_virtual_environment_vm.px_vm["debian-test-vm"]: Refreshing state... [id=104]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

real    0m35.491s
user    0m0.658s
sys     0m0.051s

Here's a new plan output now with disabled qemu guest agent ("agent.enabled = false") and same other settings:

ratiborus@HOMEWORLD:~/WORKSPACE/terraform_yandex/proxmox$ time terraform plan
module.proxmox.data.proxmox_virtual_environment_nodes.cluster_nodes: Reading...
module.proxmox.data.proxmox_virtual_environment_vms.template_vms: Reading...
module.proxmox.proxmox_virtual_environment_pool.templates: Refreshing state... [id=templates]
module.proxmox.proxmox_virtual_environment_file.px_cloud_image["astra-1.7.4-base"]: Refreshing state... [id=local:iso/astra-1.7.4-base.img]
module.proxmox.proxmox_virtual_environment_file.px_cloud_image["debian-12"]: Refreshing state... [id=local:iso/debian-12.img]
module.proxmox.proxmox_virtual_environment_file.px_cloud_image["ubuntu-22.04"]: Refreshing state... [id=local:iso/ubuntu-22.04.img]
module.proxmox.data.proxmox_virtual_environment_nodes.cluster_nodes: Read complete after 0s [id=nodes]
module.proxmox.proxmox_virtual_environment_file.cloud_config_userdata_raw_file["prox-srv3"]: Refreshing state... [id=local:snippets/userdata-proxmox-raw-file.yml]
module.proxmox.proxmox_virtual_environment_file.cloud_config_userdata_raw_file["prox-srv1"]: Refreshing state... [id=local:snippets/userdata-proxmox-raw-file.yml]
module.proxmox.proxmox_virtual_environment_file.cloud_config_userdata_raw_file["prox-srv2"]: Refreshing state... [id=local:snippets/userdata-proxmox-raw-file.yml]
module.proxmox.data.proxmox_virtual_environment_vms.template_vms: Read complete after 0s [id=2fe38a78-765e-444e-b75e-de6e2537c556]
module.proxmox.proxmox_virtual_environment_vm.px_template["ubuntu-22.04"]: Refreshing state... [id=1011]
module.proxmox.proxmox_virtual_environment_vm.px_template["debian-12"]: Refreshing state... [id=1021]
module.proxmox.proxmox_virtual_environment_vm.px_template["astra-1.7.4-base"]: Refreshing state... [id=1001]
module.proxmox.proxmox_virtual_environment_vm.px_vm["debian-test-vm"]: Refreshing state... [id=104]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

real    0m1.148s
user    0m0.539s
sys     0m0.089s

Sometimes i get such freezes for "data.proxmox_virtual_environment_vms" resource even if it's empty or does not contains affected VMs (filtered out by tags). I think that this data source first "reads" all VMs and then filters them out by listed tags and when it tries to access affected VMs it hangs for "agent.timeout" value. I could not reproduce it today though, so maybe it was my post-midnight madness...
From what i read and tested (both with provider and manually doing clones of my templates via Proxmox's GUI) it looks like Shutdown and Reboot functionality as a whole provided by Qemu Guest Agent and if its not installed in the guest VM then Proxmox's Shutdown/Reboot task just hangs for some time until it gets killed. I may be wrong about that behavior to some extent because yesterday that was what i saw but today i could not reproduce it with manually cloned VMs and using Reboot/Shutdown buttons in GUI - it worked. But i still had these problems with resources created via provider.
Here's an example where i tried to create 3 VMs from scratch (not tainted already existing resources) and they had "agent.enabled = true", "agent.timeout = 30s" and "reboot = true" (according to Proxmox's GUI the process stuck at Reboot task):

module.proxmox.proxmox_virtual_environment_vm.px_vm["ubuntu-test-vm"]: Still creating... [20m1s elapsed]
module.proxmox.proxmox_virtual_environment_vm.px_vm["astra-test-vm"]: Still creating... [20m1s elapsed]
module.proxmox.proxmox_virtual_environment_vm.px_vm["debian-test-vm"]: Still creating... [20m1s elapsed]
╷
│ Error: error waiting for VM reboot: context error while waiting for task "UPID:prox-srv1:00285BC0:054CCE16:653FF494:qmreboot:105:root@pam:" to complete: context deadline exceeded
│
│   with module.proxmox.proxmox_virtual_environment_vm.px_vm["astra-test-vm"],
│   on ../modules/proxmox-base/main.tf line 22, in resource "proxmox_virtual_environment_vm" "px_vm":
│   22: resource "proxmox_virtual_environment_vm" "px_vm" {
│
╵
╷
│ Error: error waiting for VM reboot: context error while waiting for task "UPID:prox-srv1:00285B7E:054CCC0E:653FF48F:qmreboot:104:root@pam:" to complete: context deadline exceeded
│
│   with module.proxmox.proxmox_virtual_environment_vm.px_vm["ubuntu-test-vm"],
│   on ../modules/proxmox-base/main.tf line 22, in resource "proxmox_virtual_environment_vm" "px_vm":
│   22: resource "proxmox_virtual_environment_vm" "px_vm" {
│
╵
╷
│ Error: error waiting for VM reboot: context error while waiting for task "UPID:prox-srv1:00285BC1:054CCE1C:653FF494:qmreboot:106:root@pam:" to complete: context deadline exceeded
│
│   with module.proxmox.proxmox_virtual_environment_vm.px_vm["debian-test-vm"],
│   on ../modules/proxmox-base/main.tf line 22, in resource "proxmox_virtual_environment_vm" "px_vm":
│   22: resource "proxmox_virtual_environment_vm" "px_vm" {
│
╵

Expected behavior

"agent.timeout" should not slow down terraform plan/apply in case "agent.enabled = true" and agent is not installed inside guest VM.
This timeout is a major headache for me and i do not see any value in it - can we even somehow make it useful, what's it purpose? I think about just setting it to "0s" or "1s" and that's it.
Maybe there should be some kind of arguments compatibility check and if incompatible one's are selected then error should be given? Or maybe if "agent.enabled = false" then instead of Reboot/Shutdown provider should use only Stop/Start? Obviously it won't help in case of agent not being installed inside guest machine but that's up to user to know. Or maybe we should just mention that in documentation but it's still not very clear to me.

Additional context
Add any other context about the problem here.

Single or clustered Proxmox: 3 node cluster with local storage, PVE 8.0.3
Provider version (ideally it should be the latest version): 0.36.0
Terraform version: 1.5.7
OS (where you run Terraform from): Ubuntu 22.04 (WSL2)
If necessary i could repeat some actions and provide parts of an actual configuration but I'm talking here about 2-3 arguments so it should be easily reproducible

otopetrik · 2023-10-30T22:57:46Z

1. In case agent is not installed and 'agent.enabled = true' i get increased time for VM resource creation which uses such settings. It looks like provider tries to receive a response from an agent and hangs for "agent.timeout" value.

That is correct, and intentional. Provider waits for guest agent to provide VM IP addresses, which are exported using ipv4_addresses attribute and others.

By setting agent.enabled = true, user tells Proxmox (and the provider) to expect the agent to be running inside the VM. There is no easy way to distinguish between "VM has not yet finished booting and has not yet started the agent" and "the agent is not even installed in the VM". The provider has to rely on the user to tell it the truth.

2. Same things would happen on any plan update (terraform plan/apply command).

Also correct. Other resources can depend on those IP addresses. Terraform reads resource state, to determine if its attributes have changed (and if it is necessary to modify configuration of dependent resources).

3. Sometimes i get such freezes for "data.proxmox_virtual_environment_vms" resource even if it's empty or does not contains affected VMs (filtered out by tags). I think that this data source first "reads" all VMs and then filters them out by listed tags and when it tries to access affected VMs it hangs for "agent.timeout" value. I could not reproduce it today though, so maybe it was my post-midnight madness...

It looks like the provider does only 1 API call to get list of all hosts, and them 1 API call for each proxmox host to get list of all its VMs, details here. Maybe some VM is locked by another action (Shutdown/Reboot waiting to time out...), and Proxmox takes a while to respond to list of all VMs ?

4. From what i read and tested (both with provider and manually doing clones of my templates via Proxmox's GUI) it looks like Shutdown and Reboot functionality as a whole provided by Qemu Guest Agent and if its not installed in the guest VM then Proxmox's Shutdown/Reboot task just hangs for some time until it gets killed. I may be wrong about that behavior to some extent because yesterday that was what i saw but today i could not reproduce it with manually cloned VMs and using Reboot/Shutdown buttons in GUI - it worked. But i still had these problems with resources created via provider.

Proxmox behavior depends on agent.enabled value:
If VM is configured with agent disabled, then Shutdown and Reboot is handled using ACPI events (the guest VM performs clean shutdown or reboot). They work with agent.enabled = false (unless guest os is specifically configured to ignore ACPI events)
If VM is configured with agent enabled, then Proxmox tells the agent to shut down or reboot from within the VM. If the agent is not running, Proxmox keeps VM 'locked' until the operation times out. As long as the VM is locked, attempting other actions (like Stop or Reset) does not work, they fail with VM locked error.
(This might be considered a bug in Proxmox - shutdown command to agent does not return a confirmation, but better behavior might be to try ping first and depending on the result use either guest-agent or ACPI for shutdown itself)

If it is not possible to log into the VM to shut it down it cleanly (e.g. errors in cloud-init configuration), then there is the option to use 'Monitor' tab of VM details, and run command 'quit', that will forcibly Stop selected VM (with all the data safety of pulling the power cord from a computer).

There are additional advantages in running qemu-guest agent - see Proxmox docs1, docs2, but the main purpose of using guest agent with terraform provider is access to access IP address assigned to the VM, which can then be used by other terraform resources.

If agent is not installed inside the guest, then agent.enabled must stay false. (the default)
With the possible exception of VM with cloud-init configuration which will install and start the guest agent.

VM which has agent.enabled = false works normally, but it will not provide its IP addresses for use by dependent resources, and it will not perform filesystem sync before backup. But Shutdown/Reboot button in Proxmox will perform orderly shutdown.

Expected behavior

* "agent.timeout" should not slow down terraform plan/apply in case "agent.enabled = true" and agent is not installed inside guest VM.

There just is not a way to distinguish not-installed agent from not-yet-started agent. The user just has to tell the provider the truth or suffer the consequences.

* This timeout is a major headache for me and i do not see any value in it - can we even somehow make it useful, what's it purpose? I think about just setting it to "0s" or "1s" and that's it.

The rather large default timeout is there, because it is possible for VM with agent installed to take rather long time to boot (and start the agent). Consider VM which performs long disk check on boot (because unclean shutdown), when using spinning HDDs it could take 10 minutes before disk check finishes and guest agent finally starts.
If user has not installed the agent setting agent.enabled = false (which is actually the default!), the provider should skip asking the agent for IP addresses.

* Maybe there should be some kind of arguments compatibility check and if incompatible one's are selected then error should be given? Or maybe if "agent.enabled = false" then instead of Reboot/Shutdown provider should use only Stop/Start? Obviously it won't help in case of agent not being installed inside guest machine but that's up to user to know. Or maybe we should just mention that in documentation but it's still not very clear to me.

Is there any compatiblity issue with 'reboot = true' when agent.enabled setting matches the state of guest-agent running inside the VM ?
(I do not remember ever creating a VM with 'reboot = true', so really have no idea)

If VM has agent.enabled = true (same as in Proxmox GUI: Options - QEMU Guest Agent - Enabled / Use QEMU Guest Agent), then operations "Shutdown" and "Reboot" are performed using the agent. This is true for both Proxmox GUI and the provider. (If there is no agent running in the VM, the operation will time out...)

If VM has agent.enabled = false then "Shutdown" and "Reboot" are done using ACPI (and guest agent does not have to be installed in the VM).

Using "Stop" instead of "Shutdown" (or "Reset" instead of "Reboot") are not reasonable substitutes. Guest operating system would not have the option to cleanly shutdown, and data loss would be almost certain.

ratiborusx · 2023-10-31T08:27:36Z

@otopetrik
Thank you for this very detailed explanation, i think it resolves most of my misunderstandings on the issue!

That is correct, and intentional. Provider waits for guest agent to provide VM IP addresses, which are exported using ipv4_addresses attribute and others.

Yeah, i completely forgot about possibility to retrieve IP addresses of the VMs even though i planned to use that functionality myself! It just didn't make into my module yet and that's probably why i missed the point here.

If agent is not installed inside the guest, then agent.enabled must stay false. (the default)
With the possible exception of VM with cloud-init configuration which will install and start the guest agent.

I thought so. That's basically how i deploy qemu-guest-agent - by using cloud-init. Only one of my cloud images has agent preinstalled. I was testing out some people's requests that sometimes they would not want cloud-init being employed.

Is there any compatiblity issue with 'reboot = true' when agent.enabled setting matches the state of guest-agent running inside the VM ?
(I do not remember ever creating a VM with 'reboot = true', so really have no idea).

There's no any issue when its running and enabled. I still have issues when agent is not running inside a guest and i disable in resource's arguments and employ "reboot = true". I just tested it right now to be sure. If i understood correctly what you said in such configuration (agent is not installed and disabled) reboot should still be possible.

If VM has agent.enabled = false then "Shutdown" and "Reboot" are done using ACPI (and guest agent does not have to be installed in the VM).

I use that reboot option to make sure that all cloud-init provisioning is done completely and correctly and that deployed guest VM won't have any issues after reboot. Better to know it sooner than later. Also Ubuntu for example even with relatively fresh images installs plenty of updates and usually requests reboot which is pending in the system.
Thanks again for that thorough explanation, it makes most if not all things concerning the issue clear to me!

@bpg I'm sorry for inconvenience if my comment will reopen the issue and i'll be unsuccessful in closing it again.

ratiborusx · 2023-10-31T08:49:42Z

@otopetrik
I misread your comment about "reboot" option so i updated my answer. I still got issues with it if agent is not present in the guest VM. Maybe you could point out what's wrong on my side?
Here's my "basic vm" resource:

resource "proxmox_virtual_environment_vm" "px_vm" {
  for_each    = var.config_px_vm

  name        = each.key
  description = each.value.description
  tags        = sort(concat(each.value.tags, ["terraform"]))
  pool_id     = each.value.pool_id

  node_name     = each.value.node_name
  vm_id         = each.value.vm_id
  migrate       = each.value.migrate
  on_boot       = each.value.on_boot
  started       = each.value.started
  reboot        = each.value.reboot
  scsi_hardware = each.value.scsi_hardware
  boot_order    = each.value.boot_order

  agent {
    enabled = each.value.agent_enabled
    trim    = true
    timeout = "30s"
  }

  clone {
    node_name    = each.value.clone_node_name
    retries      = 3
    vm_id        = proxmox_virtual_environment_vm.px_template[each.value.clone_vm_id].id
  }

  cpu {
    architecture = each.value.cpu_arch
    type         = each.value.cpu_type
    cores        = each.value.cores
    sockets      = each.value.sockets
    numa         = each.value.numa
  }

  memory {
    dedicated    = each.value.memory
    floating     = each.value.memory
  }

  disk {
    datastore_id = "vzdata"
    discard      = "on"
    file_format  = "qcow2"
    interface    = "scsi0"
    iothread     = true
    size         = 50
    ssd          = true
  }

  initialization {
    datastore_id = "vzdata"
    interface    = "ide2"
    ip_config {
      ipv4 {
        address  = each.value.ipv4_address
        gateway  = each.value.ipv4_address == "dhcp" ? null : each.value.ipv4_gateway
      }
    }
    user_data_file_id = each.value.cicustom_userdata == false ? null : proxmox_virtual_environment_file.cloud_config_userdata_raw_file[each.value.clone_node_name].id
  }

  network_device {
    bridge = "vmbr0"
  }

  operating_system {
    type = each.value.os_type
  }

  serial_device {
    device = "socket"
  }
  # lifecycle {
  #   ignore_changes = [
  #     initialization[0].user_account,
  #     initialization[0].user_data_file_id
  #   ]
  # }
}

And here's a map of values I'm iterating over:

locals {
  config_px_vm = {
    "astra-test-vm" = {
      description       = "Managed by Terraform"
      tags              = ["astra-1.7.4-base"]
      node_name         = "prox-srv1"
      migrate           = true
      on_boot           = true
      started           = true
      reboot            = true
      cicustom_userdata = false
      agent_enabled     = false
      scsi_hardware     = "virtio-scsi-single"
      boot_order        = ["scsi0"]
      clone_node_name   = "prox-srv1"
      clone_vm_id       = "astra-1.7.4-base"
      cpu_arch          = "x86_64"
      cpu_type          = "host"
      cores             = 1
      sockets           = 2
      numa              = true
      memory            = 4096
      ipv4_address      = "10.177.144.224/24"
      ipv4_gateway      = "10.177.144.254"
      os_type = "l26"
    }
    "ubuntu-test-vm" = {
      description       = "Managed by Terraform"
      tags              = ["ubuntu-22.04"]
      node_name         = "prox-srv1"
      migrate           = true
      on_boot           = true
      started           = true
      reboot            = true
      cicustom_userdata = false
      agent_enabled     = false
      scsi_hardware     = "virtio-scsi-single"
      boot_order        = ["scsi0"]
      clone_node_name   = "prox-srv1"
      clone_vm_id       = "ubuntu-22.04"
      cpu_arch          = "x86_64"
      cpu_type          = "host"
      cores             = 1
      sockets           = 2
      numa              = true
      memory            = 4096
      ipv4_address      = "dhcp"
      os_type           = "l26"
    }
    "debian-test-vm" = {
      description       = "Managed by Terraform"
      tags              = ["debian-12"]
      node_name         = "prox-srv1"
      migrate           = true
      on_boot           = true
      started           = true
      reboot            = true
      cicustom_userdata = false
      agent_enabled     = false
      scsi_hardware     = "virtio-scsi-single"
      boot_order        = ["scsi0"]
      clone_node_name   = "prox-srv1"
      clone_vm_id       = "debian-12"
      cpu_arch          = "x86_64"
      cpu_type          = "host"
      cores             = 1
      sockets           = 2
      numa              = true
      memory            = 4096
      ipv4_address      = "dhcp"
      os_type           = "l26"
    }
  }
}

Resulting in:

module.proxmox.proxmox_virtual_environment_vm.px_vm["ubuntu-test-vm"]: Still creating... [3m40s elapsed]
module.proxmox.proxmox_virtual_environment_vm.px_vm["astra-test-vm"]: Still creating... [3m40s elapsed]
module.proxmox.proxmox_virtual_environment_vm.px_vm["debian-test-vm"]: Still creating... [3m40s elapsed]

otopetrik · 2023-10-31T09:55:53Z

Thanks for exploring the reboot behavior!
I have not used the reboot option and looking at the sources, it causes the provider to trigger VM reboot very soon after starting the VM, see here.
The 'vmStart' call seems to return once Proxmox shows the VM is in "running" state, which just means that qemu is running. The guest can be in very early boot stages - e.g. in bootloader or just starting the linux kernel.

If VM is configured with agent.enabled = true, there is not an agent running this early to receive it, the reboot request is missed.
If VM is configured with agent.enabled = false, it is possible that os or kernel is not yet in state to receive ACPI event and the reboot request is missed.

In any case, I see no reason to attempt to reboot the VM this early in the boot process and have no real idea about the intented use case.
The only possibility I can thing of is causing reboot after cloning a running VM including its RAM contents, which seems very scary (and hopefully not even possible).

To reboot VM after cloud-init, consider using package_reboot_if_required or power_state.

Reboot from inside the VM can ensure that all the configuration is actually applied.

It does not seem like the reboot behavior can even be fixed, at best it would be really fragile:

Making it (somewhat) work with agent.enabled = false would likely require adding delay_before_reboot option, to specify delay in seconds before attempting the reboot. But with no way to be sure that VM actually finished configuring, reboot could happen during VM configuration.

Making it work with agent.enabled = true could possibly be implemented by repeated pinging of the agent and triggering the reboot only once the agent starts responding to pings. Some linux distributions will start the agent immediately as part of the qemu-guest-agent package installation (i.e. early in the cloud-init process), while others require the user to start it (i.e. using runcmd late in the cloud-init process). So running agent is not a guarantee that the VM has actually finished configuring, reboot could happen during VM configuration (before user's runcmd commands are finished).

Without knowing the original purpose of the reboot option, it might not be good idea to remove it outright.
But marking it deprecated in the documentation could be a good start.

If you thing that reboot option should behave differently (or be removed or deprecated) please create separate issue, or even better a pull request.

ratiborusx added the 🐛 bug Something isn't working label Oct 30, 2023

otopetrik mentioned this issue Oct 31, 2023

fix(docs): document qemu-guest-agent behavior #670

Merged

3 tasks

bpg closed this as completed in #670 Oct 31, 2023

otopetrik mentioned this issue Nov 21, 2023

Reboot needed after cloud-init has been applied [Resolved] #736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bunch of (possible) problems with qemu-guest-agent #669

Bunch of (possible) problems with qemu-guest-agent #669

ratiborusx commented Oct 30, 2023

otopetrik commented Oct 30, 2023

ratiborusx commented Oct 31, 2023 •

edited

Loading

ratiborusx commented Oct 31, 2023

otopetrik commented Oct 31, 2023

Bunch of (possible) problems with qemu-guest-agent #669

Bunch of (possible) problems with qemu-guest-agent #669

Comments

ratiborusx commented Oct 30, 2023

otopetrik commented Oct 30, 2023

ratiborusx commented Oct 31, 2023 • edited Loading

ratiborusx commented Oct 31, 2023

otopetrik commented Oct 31, 2023

ratiborusx commented Oct 31, 2023 •

edited

Loading