libvirt: can't create workers with non-default storage path #308

mrogers950 · 2018-09-21T21:43:23Z

If your libvirt default storage pool is not /var/lib/libvirt/images then the libvirt-machine-controller fails to create the workers:

$ oc logs pod/clusterapi-controllers-85f6bfd9d5-6rbb8 -n openshift-cluster-api -c libvirt-machine-controller
...
I0921 18:24:47.612590       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:47.612648       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-b9f4v for pool default from the base volume /var/lib/libvirt/images/coreos_base
E0921 18:24:47.614725       1 actuator.go:50] Coud not create libvirt machine: error creating volume: Can't retrieve volume /var/lib/libvirt/images/coreos_base
I0921 18:24:48.016159       1 controller.go:79] Running reconcile Machine for worker-2fp6s
I0921 18:24:48.023462       1 actuator.go:70] Checking if machine worker-2fp6s for cluster dev exists.
I0921 18:24:48.023638       1 logs.go:41] [DEBUG] Check if a domain exists
I0921 18:24:48.029976       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:48.030852       1 controller.go:123] reconciling machine object worker-2fp6s triggers idempotent create.
I0921 18:24:48.033465       1 actuator.go:46] Creating machine "worker-2fp6s" for cluster "dev".
I0921 18:24:48.036047       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:48.036107       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-2fp6s for pool default from the base volume /var/lib/libvirt/images/coreos_base
E0921 18:24:48.038373       1 actuator.go:50] Coud not create libvirt machine: error creating volume: Can't retrieve volume /var/lib/libvirt/images/coreos_base

I worked around it with a bind mount. If it's not configurable then it would be handy if it was.
(also, there's a typo "Coud" in the error message)

The text was updated successfully, but these errors were encountered:

wking · 2018-09-23T06:47:49Z

I hit this too, and am trying to track down where the /var/lib/libvirt/images is coming from. On my host:

$ sudo ls /var/lib/libvirt/images/
$ virsh -c qemu+tcp://192.168.122.1/system pool-list
 Name                 State      Autostart 
-------------------------------------------
 default              active     yes       

$ virsh -c qemu+tcp://192.168.122.1/system pool-dumpxml default
<pool type='dir'>
  <name>default</name>
  <uuid>c20a2154-aa60-44cf-bf37-cd8b7818a4e4</uuid>
  <capacity unit='bytes'>105554829312</capacity>
  <allocation unit='bytes'>44038131712</allocation>
  <available unit='bytes'>61516697600</available>
  <source>
  </source>
  <target>
    <path>/home/trking/VirtualMachines</path>
    <permissions>
      <mode>0777</mode>
      <owner>114032</owner>
      <group>114032</group>
      <label>system_u:object_r:virt_image_t:s0</label>
    </permissions>
  </target>
</pool>

$ ls /home/trking/VirtualMachines/
bootstrap  bootstrap.ign  coreos_base  master0  master-0.ign  worker.ign

In the installer, we have settings for the pool and volume, which we currently hard-code to default and coreos_base. We set the volume, so hard-coding that shouldn't be a problem. We don't set pool when we create the volume, so we get the Terraform provider's default default. So far, so good. We push those values into the cluster since #205 for the machine-config-operator to pick up (openshift/machine-config-operator#47). Our ImagePool and ImageVolume settings are recent (#271), but the MCO doesn't seem to be looking at either the old QCOWImagePath or the new Image* properties (at least as of openshift/machine-config-operator@d948fb8baa63a).

Then the chain of custody gets fuzzy for me.

On the other end, the /var/lib/... path is the actuator default for baseVolumePath, but I'm not clear on whether baseVolumePath plays into our chain.

From your logged:

I0921 18:24:47.612648       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-b9f4v for pool default from the base volume /var/lib/libvirt/images/coreos_base

we see that by the time we got here, we had default as the poolName (correct) and /var/lib/libvirt/images/coreos_base as the baseVolumeID (questionable). Then we lookup the pool. And then we die trying to find the volume in that pool. We should be looking up the volume by coreos_base; e.g. with virsh:

$ virsh -c qemu+tcp://192.168.122.1/system vol-info --vol coreos_base --pool default
Name:           coreos_base
Type:           file
Capacity:       16.00 GiB
Allocation:     1.55 GiB

The issue is probably the fact that that lookup also works when you happen to use the correct full path:

$ virsh -c qemu+tcp://192.168.122.1/system vol-info --vol /home/trking/VirtualMachines/coreos_base --pool default
Name:           coreos_base
Type:           file
Capacity:       16.00 GiB
Allocation:     1.55 GiB

So can we connect the dots between our config and the busted baseVolumeID? The actuator is getting the value from machine-provider config. Who writes that config? Maybe the machine-API operator using this template? A short-term patch is probably updating that template to use just coreos_base. A long-term fix is probably updating something (that same template?) to use a value pulled (possibly indirectly) from the cluster config the installer is pushing.

wking · 2018-09-23T07:04:15Z

Possible fix in openshift/machine-api-operator#70.

nhosoi · 2019-01-14T01:30:27Z

I ran into the similar issue with v0.9.1.

For an experiment, I replaced the following hardcoded path "/var/lib/libvirt/images" with my storage path (/home/VMpool).
https://github.com/openshift/installer/blob/master/pkg/asset/machines/libvirt/machines.go#L74

Then, my worker node image and its ignition file are placed in the storage path.

# ls /home/VMpool
ntest0-base  ntest0-master-0  ntest0-master.ign  ntest0-worker-0-4vlw2  ntest0-worker-0-4vlw2.ignition

But the worker node failed to start with this error:
W0113 17:45:53.363756 1 controller.go:183] unable to create machine ntest0-worker-0-4vlw2: ntest0/ntest0-worker-0-4vlw2: error creating libvirt machine: error creating domain Failed to setDisks: Can't retrieve volume /var/lib/libvirt/images/ntest0-worker-0-4vlw2

Obviously, it expects to see the worker node image ntest0-worker-0-4vlw2 in /var/lib/libvirt/images instead of my storage path /home/VMpool... But I cannot find the place /var/lib/libvirt/images is hardcoded or expected as a default path in the installer. Do you have any idea how I can workaround this issue?

Thanks!

wking · 2019-01-14T02:51:51Z

But I cannot find the place /var/lib/libvirt/images is hardcoded...

openshift/cluster-api-provider-libvirt#45 (the successor to openshift/machine-api-operator#70 linked above).

nhosoi · 2019-01-14T17:29:22Z

Thank you, @wking.

I hope openshift/cluster-api-provider-libvirt#45 is going to be merged and cluster-api-provider-libvirt will be rebuilt soon...

steven-ellis · 2019-04-15T05:17:04Z

This is still an outstanding issue. I'm using #1371 to bootstrap on libvirt and just hit this storage issue,

markmc · 2019-04-15T06:00:25Z

Looks like this is the issue: https://github.com/openshift/installer/blob/master/pkg/asset/machines/libvirt/machines.go#L71

                Volume: &libvirtprovider.Volume{
			PoolName:     "default",
			BaseVolumeID: fmt.Sprintf("/var/lib/libvirt/images/%s-base", clusterID),
		},

i.e. when the installer generating the provider spec for machines, it guesses what the volume ID generated by libvirt for the base image will be

The base image is created here https://github.com/openshift/installer/blob/master/data/data/libvirt/main.tf#L5

module "volume" {
  source = "./volume"

  cluster_id = "${var.cluster_id}"
  image      = "${var.os_image}"
}

and the volume id is referenced as ${module.volume.coreos_base_volume_id}

Probably the easiest solution is to allow configuring the volume in the provider spec with a name rather than volume id

i.e. right now we require:

      volume:
        poolName: default
        baseVolumeID: /var/lib/libvirt/images/coreos_base

but there's no reason to require the volume key if we just have the pool and the volume name. This should be sufficient:

      volume:
        poolName: default
        baseVolumeName: coreos_base

Of course, the actuator needs a patch to do virStorageVolLookupByName() in this case rather than the virStorageVolLookupByKey() it does now

zeenix · 2019-04-15T14:34:07Z

Of course, the actuator needs a patch to do virStorageVolLookupByName() in this case rather than the virStorageVolLookupByKey() it does now

Correct. This was done in openshift/cluster-api-provider-libvirt#45 which I've finally rebased and reworked a bit. I'm going to test it today and create a new PR.

steven-ellis · 2019-05-21T01:38:34Z

Looks like the fix is now in openshift/cluster-api-provider-libvirt#144
I'm re-testing on my local environment off master.

zeenix · 2019-05-21T10:42:32Z

@steven-ellis the libvirt actuator bit, yes but the Installer part is still not merged due to CI being flaky: #1628

wking mentioned this issue Sep 23, 2018

actuators/machine: Fix "Coud" -> "Could" typo openshift/cluster-api-provider-libvirt#23

Merged

wking mentioned this issue Sep 23, 2018

machines/libvirt/worker.machineset.yaml: Drop /var/lib/libvirt/images openshift/machine-api-operator#70

Closed

wking added the platform/libvirt label Jan 11, 2019

zeenix mentioned this issue Mar 25, 2019

Use own storage pool #1457

Closed

2 tasks

steven-ellis mentioned this issue Apr 15, 2019

docs/dev/libvirt: Add troubleshooting docs for libvirt console issue. #1371

Merged

zeenix mentioned this issue May 21, 2019

machines/libvirt: Drop volume paths #1628

Merged

openshift-merge-robot closed this as completed in #1628 May 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libvirt: can't create workers with non-default storage path #308

libvirt: can't create workers with non-default storage path #308

mrogers950 commented Sep 21, 2018 •

edited

Loading

wking commented Sep 23, 2018

wking commented Sep 23, 2018

nhosoi commented Jan 14, 2019

wking commented Jan 14, 2019

nhosoi commented Jan 14, 2019

steven-ellis commented Apr 15, 2019

markmc commented Apr 15, 2019

zeenix commented Apr 15, 2019

steven-ellis commented May 21, 2019

zeenix commented May 21, 2019

libvirt: can't create workers with non-default storage path #308

libvirt: can't create workers with non-default storage path #308

Comments

mrogers950 commented Sep 21, 2018 • edited Loading

wking commented Sep 23, 2018

wking commented Sep 23, 2018

nhosoi commented Jan 14, 2019

wking commented Jan 14, 2019

nhosoi commented Jan 14, 2019

steven-ellis commented Apr 15, 2019

markmc commented Apr 15, 2019

zeenix commented Apr 15, 2019

steven-ellis commented May 21, 2019

zeenix commented May 21, 2019

mrogers950 commented Sep 21, 2018 •

edited

Loading