Changing a disk's datastore when HA is enabled causes various problems #507

tseeker · 2023-08-20T13:56:40Z

Describe the bug

I have some VMs that have been created on the Proxmox cluster and added as HA resources. I am trying to modify the datastore for some disks on these VMs.

When trying to do that, the plugin gets confused by the fact that sending a shutdown command to a HA-enabled VM returns immediately, without the VM having been shut down. Because of this fact:

the plugin moves on to trying to move the disk;
the Proxmox cluster's HA manager tries to shutdown the VM, which is now locked by the disk move operation;
after a few attempts, the Proxmox cluster sets the VM's HA state to error.

On some of the VMs the disk was not fully moved (copy created on new datastore but disk hardware unchanged), on others there is a residual Unused disk for which no Rados image actually exists.

To Reproduce
Steps to reproduce the behavior:

Create a Terraform resource for a few VMs with one disk on some datastore,
Manually add the VMs as HA resources,
Update the Terraform resource in order to change the disks' datastore.
Run Terraform
Watch the fireworks happening on the Proxmox cluster

Expected behavior
The VMs should shut down, then the disks should be moved, and then they should restart.

Screenshot

The third line here is unrelated, but the other lines clearly show the problem.

Additional context

The issue doesn't occur systematically. Trying to do it on multiple VMs in parallel improves the chances of triggering it.
Some of the additional mess with the disk's configuration may be my fault as I tried to set the HA state to disabled while the operation was still running.
Using latest version + my own HA support branch, but I also tested with a manually-managed HA resource.
My datastores are all Ceph-based, although I don't believe this has any influence.

The text was updated successfully, but these errors were encountered:

tseeker · 2023-08-20T14:07:48Z

I can confirm that the disk misconfiguration that followed was caused by my attempts at disabling the HA resource. If left untouched, it will exit with an error when trying to start the VM, as the HA resource's error state will prevent that.

tseeker added the 🐛 bug Something isn't working label Aug 20, 2023

tseeker mentioned this issue Aug 20, 2023

fix(vm): fixed startup / shutdown behaviour on HA clusters #508

Merged

3 tasks

bpg closed this as completed in #508 Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing a disk's datastore when HA is enabled causes various problems #507

Changing a disk's datastore when HA is enabled causes various problems #507

tseeker commented Aug 20, 2023

tseeker commented Aug 20, 2023

Changing a disk's datastore when HA is enabled causes various problems #507

Changing a disk's datastore when HA is enabled causes various problems #507

Comments

tseeker commented Aug 20, 2023

tseeker commented Aug 20, 2023