Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing a disk's datastore when HA is enabled causes various problems #507

Closed
tseeker opened this issue Aug 20, 2023 · 1 comment · Fixed by #508
Closed

Changing a disk's datastore when HA is enabled causes various problems #507

tseeker opened this issue Aug 20, 2023 · 1 comment · Fixed by #508
Labels
🐛 bug Something isn't working

Comments

@tseeker
Copy link
Contributor

tseeker commented Aug 20, 2023

Describe the bug

I have some VMs that have been created on the Proxmox cluster and added as HA resources. I am trying to modify the datastore for some disks on these VMs.

When trying to do that, the plugin gets confused by the fact that sending a shutdown command to a HA-enabled VM returns immediately, without the VM having been shut down. Because of this fact:

  • the plugin moves on to trying to move the disk;
  • the Proxmox cluster's HA manager tries to shutdown the VM, which is now locked by the disk move operation;
  • after a few attempts, the Proxmox cluster sets the VM's HA state to error.

On some of the VMs the disk was not fully moved (copy created on new datastore but disk hardware unchanged), on others there is a residual Unused disk for which no Rados image actually exists.

To Reproduce
Steps to reproduce the behavior:

  1. Create a Terraform resource for a few VMs with one disk on some datastore,
  2. Manually add the VMs as HA resources,
  3. Update the Terraform resource in order to change the disks' datastore.
  4. Run Terraform
  5. Watch the fireworks happening on the Proxmox cluster

Expected behavior
The VMs should shut down, then the disks should be moved, and then they should restart.

Screenshot

lock-failure

The third line here is unrelated, but the other lines clearly show the problem.

Additional context

  • The issue doesn't occur systematically. Trying to do it on multiple VMs in parallel improves the chances of triggering it.
  • Some of the additional mess with the disk's configuration may be my fault as I tried to set the HA state to disabled while the operation was still running.
  • Using latest version + my own HA support branch, but I also tested with a manually-managed HA resource.
  • My datastores are all Ceph-based, although I don't believe this has any influence.
@tseeker tseeker added the 🐛 bug Something isn't working label Aug 20, 2023
@tseeker
Copy link
Contributor Author

tseeker commented Aug 20, 2023

I can confirm that the disk misconfiguration that followed was caused by my attempts at disabling the HA resource. If left untouched, it will exit with an error when trying to start the VM, as the HA resource's error state will prevent that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant