Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start a vm without qemu-guest-agent installed and sits on 'still creating' #449

Closed
si458 opened this issue Jul 30, 2023 · 8 comments
Closed
Labels
🐛 bug Something isn't working

Comments

@si458
Copy link

si458 commented Jul 30, 2023

Describe the bug
if you create a VM with an ISO image instead of cloud-init, the VM sits on 'still creating' even though its started

To Reproduce
Steps to reproduce the behavior:

  1. create VM with an ISO image instead of cloud-init

Please also provide a minimal Terraform configuration that reproduces the issue.

resource "proxmox_virtual_environment_vm" "talos" {
  count = 3
  agent { enabled = true }
  bios       = "ovmf"
  boot_order = ["scsi0", "ide2"]
  cdrom {
    enabled   = true
    file_id   = "ISOImages:iso/talos-amd64-v1.4.6.iso"
    interface = "ide2"
  }
  cpu {
    cores = 4
    type  = "x86-64-v2-AES"
  }
  disk {
    datastore_id = "local-zfs"
    discard      = "on"
    file_format  = "raw"
    interface    = "scsi0"
    iothread     = true
    size         = 32
    cache        = "writeback"
  }
  disk {
    datastore_id = "local-zfs"
    discard      = "on"
    file_format  = "raw"
    interface    = "scsi1"
    iothread     = true
    size         = 128
    cache        = "writeback"
  }
  efi_disk {
    datastore_id = "local-zfs"
    file_format  = "raw"
    type         = "4m"
  }
  machine = "q35"
  memory {
    dedicated = 8192
    floating  = 8192
  }
  name = "talos-${count.index + 1}"
  network_device {
    bridge  = "vmbr168"
    vlan_id = 251
  }
  node_name = "pve1"
  on_boot   = false
  operating_system { type = "l26" }
  serial_device {}
  scsi_hardware = "virtio-scsi-single"

}

Expected behavior
the VM to be started and then move onto next step

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

  • Provider version (ideally it should be the latest version): 0.27.0
  • Terraform version: 1.5.4
  • OS (where you run Terraform from): Windows
  • Debug logs (TF_LOG=DEBUG terraform apply):
@si458 si458 added the 🐛 bug Something isn't working label Jul 30, 2023
@si458
Copy link
Author

si458 commented Jul 30, 2023

setting $env:TF_LOGS="debug" is revealing that the proxmox is checking if the agent is replying with network-get-interfaces
this will always return 500 because the guest agent isnt installed
the best way to check if the VM is running is using the status/current api instead
https://pve.proxmox.com/pve-docs/api-viewer/index.html#/nodes/{node}/qemu/{vmid}/status/current

EDIT: maybe an option we can set to say 'dont check the status of the vm with the qemu-guest-agent but use the VM status/current api instead'?

@si458
Copy link
Author

si458 commented Jul 31, 2023

found issue i think?
you are checking for network interfaces because i have agent set and enabled, BUT this doesnt mean the guest agent is actually running!

if started {
if vmConfig.Agent != nil && vmConfig.Agent.Enabled != nil && *vmConfig.Agent.Enabled {
resource := VM()

you can check if qemu-guest-agent is actually running by calling https://IP:8006/api2/json/nodes/{NODE}/qemu/{ID}/agent/info which will return 200 if running or 500 if not running

@otopetrik
Copy link
Contributor

Expected behavior
the VM to be started and then move onto next step

For boot to continue without waiting for qemu-guest-agent (when there is no qemu-guest-agent actually installed), just set agent { enabled = false }.

It is generally bad idea to to configure VM with guest agent enabled, and not actually install and enable it inside the VM.

you are checking for network interfaces because i have agent set and enabled, BUT this doesnt mean the guest agent is actually running!

Observed Proxmox behavior of VM with agent { enabled = true } suggests that Proxmox to expects guest-agent to be running during normal opration (i.e. configured to start at boot).

If guest-agent is enabled in VM configuration, Proxmox uses the guest-agent to cleanly shutdown VM instead of the default method (ACPI?).

Shutting down such VM from Proxmox is a bit tricky:

  • "shutdown" command in Proxmox UI uses guest-agent (and takes lock on the VM in Proxmox), but there is no guest-agent inside VM listening... so VM has no idea it should shut down and just keeps running
  • following it with an attempted unclean shutdown using "stop" in Proxmox UI fails, because it cannot get the lock ("shutdown" still has it)

The remaining option is to use Monitor card in Proxmox UI to terminate VM directly using qemu monitor interface.

The only case where setting agent { enabled = true } for VM without guest-agent might make sense is VM cloned from a template and using cloud-init configuration to install the guest-agent.
And even then, debugging cloud-init configuration requires to use "stop" not "shutdown" to avoid the above shutdown issue.

you can check if qemu-guest-agent is actually running by calling https://IP:8006/api2/json/nodes/{NODE}/qemu/{ID}/agent/info which will return 200 if running or 500 if not running

There does not seem to be any moment in time when the response "not running" can be understood as "qemu-guest-agent not installed and will not be started at all, just continue".

VM with qemu-guest-agent installed (and enabled) can take long time after boot before qemu-guest-agent actually gets started (e.g. fsck takes long time to check the disk, or cloud-init is installing many packages before installing qemu-guest-agent...).

Currently this provider waits for (at least) 1 "usable" IP address.
It is sometimes useful to configure VM to delay starting qemu-guest-agent inside the VM, until all expected IP addresses are obtained (e.g. both 1 usable IPv4 and 1 usable IPv6), that ensures that both IPv4 and IPv6 addresses are available in for use in further terraform resources (see attributesipv4_addresses and ipv6_addresses of proxmox_virtual_environment_vm).

I have no experience with Talos linux, but it looks like using nocloud image for template VM should allows configuring "Talos machine config" YAML file using cloud-init (which can be generated in terraform) in cloned VMs.
And including qemu-guest-agent extension there might be the similar to installing qemu-guest-agent using cloud-init on a classic linux distribution.

In that case the provider would still wait until qemu-guest-agent is started, but qemu-guest-agent should get installed and started automatically.

@si458
Copy link
Author

si458 commented Jul 31, 2023

the issue you have with talos is qemu-guest-agent isnt included in the OS and wont be included in the OS even with their new release of 1.5 as well ( i know stupid 🤦 )
so you have to enable the extension to get it enabled which can only happen when the VM is up and running on the installer then applying the config for that VM

i will have alook at the nocloud image tho as ive never seen that option before thanks @otopetrik!

i have always used the iso installer and waited for the IP of the vm which i staticly assign to reply to ping before then continuing with the setup etc

@bpg
Copy link
Owner

bpg commented Aug 2, 2023

@otopetrik thanks for the detailed response!

@si458 I think we can close this ticket as "not a bug"?

@si458
Copy link
Author

si458 commented Aug 2, 2023

@bpg it can be closed as 'not a bug' but i will open another for another issue i have come across then using @otopetrik suggestion of smbios as this is a feature request!

@bpg bpg closed this as not planned Won't fix, can't repro, duplicate, stale Aug 2, 2023
@mritalian
Copy link
Contributor

Sorry, to be clear: it's expected that the provider will hang if you set agent = true but no agent is running? Just want to confirm

@bpg
Copy link
Owner

bpg commented Sep 20, 2023

@mritalian Yes, this is the current behaviour. Provider polls the agent endpoint waiting until agent start responding. I believe it times out after 20min by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants