Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libvirt: can't create workers with non-default storage path #308

Closed
mrogers950 opened this issue Sep 21, 2018 · 10 comments · Fixed by #1628
Closed

libvirt: can't create workers with non-default storage path #308

mrogers950 opened this issue Sep 21, 2018 · 10 comments · Fixed by #1628

Comments

@mrogers950
Copy link
Contributor

mrogers950 commented Sep 21, 2018

If your libvirt default storage pool is not /var/lib/libvirt/images then the libvirt-machine-controller fails to create the workers:

$ oc logs pod/clusterapi-controllers-85f6bfd9d5-6rbb8 -n openshift-cluster-api -c libvirt-machine-controller
...
I0921 18:24:47.612590       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:47.612648       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-b9f4v for pool default from the base volume /var/lib/libvirt/images/coreos_base
E0921 18:24:47.614725       1 actuator.go:50] Coud not create libvirt machine: error creating volume: Can't retrieve volume /var/lib/libvirt/images/coreos_base
I0921 18:24:48.016159       1 controller.go:79] Running reconcile Machine for worker-2fp6s
I0921 18:24:48.023462       1 actuator.go:70] Checking if machine worker-2fp6s for cluster dev exists.
I0921 18:24:48.023638       1 logs.go:41] [DEBUG] Check if a domain exists
I0921 18:24:48.029976       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:48.030852       1 controller.go:123] reconciling machine object worker-2fp6s triggers idempotent create.
I0921 18:24:48.033465       1 actuator.go:46] Creating machine "worker-2fp6s" for cluster "dev".
I0921 18:24:48.036047       1 logs.go:41] [INFO] Created libvirt client
I0921 18:24:48.036107       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-2fp6s for pool default from the base volume /var/lib/libvirt/images/coreos_base
E0921 18:24:48.038373       1 actuator.go:50] Coud not create libvirt machine: error creating volume: Can't retrieve volume /var/lib/libvirt/images/coreos_base

I worked around it with a bind mount. If it's not configurable then it would be handy if it was.
(also, there's a typo "Coud" in the error message)

@wking
Copy link
Member

wking commented Sep 23, 2018

I hit this too, and am trying to track down where the /var/lib/libvirt/images is coming from. On my host:

$ sudo ls /var/lib/libvirt/images/
$ virsh -c qemu+tcp://192.168.122.1/system pool-list
 Name                 State      Autostart 
-------------------------------------------
 default              active     yes       

$ virsh -c qemu+tcp://192.168.122.1/system pool-dumpxml default
<pool type='dir'>
  <name>default</name>
  <uuid>c20a2154-aa60-44cf-bf37-cd8b7818a4e4</uuid>
  <capacity unit='bytes'>105554829312</capacity>
  <allocation unit='bytes'>44038131712</allocation>
  <available unit='bytes'>61516697600</available>
  <source>
  </source>
  <target>
    <path>/home/trking/VirtualMachines</path>
    <permissions>
      <mode>0777</mode>
      <owner>114032</owner>
      <group>114032</group>
      <label>system_u:object_r:virt_image_t:s0</label>
    </permissions>
  </target>
</pool>

$ ls /home/trking/VirtualMachines/
bootstrap  bootstrap.ign  coreos_base  master0  master-0.ign  worker.ign

In the installer, we have settings for the pool and volume, which we currently hard-code to default and coreos_base. We set the volume, so hard-coding that shouldn't be a problem. We don't set pool when we create the volume, so we get the Terraform provider's default default. So far, so good. We push those values into the cluster since #205 for the machine-config-operator to pick up (openshift/machine-config-operator#47). Our ImagePool and ImageVolume settings are recent (#271), but the MCO doesn't seem to be looking at either the old QCOWImagePath or the new Image* properties (at least as of openshift/machine-config-operator@d948fb8baa63a).

Then the chain of custody gets fuzzy for me.

On the other end, the /var/lib/... path is the actuator default for baseVolumePath, but I'm not clear on whether baseVolumePath plays into our chain.

From your logged:

I0921 18:24:47.612648       1 logs.go:41] [DEBUG] Create a libvirt volume with name worker-b9f4v for pool default from the base volume /var/lib/libvirt/images/coreos_base

we see that by the time we got here, we had default as the poolName (correct) and /var/lib/libvirt/images/coreos_base as the baseVolumeID (questionable). Then we lookup the pool. And then we die trying to find the volume in that pool. We should be looking up the volume by coreos_base; e.g. with virsh:

$ virsh -c qemu+tcp://192.168.122.1/system vol-info --vol coreos_base --pool default
Name:           coreos_base
Type:           file
Capacity:       16.00 GiB
Allocation:     1.55 GiB

The issue is probably the fact that that lookup also works when you happen to use the correct full path:

$ virsh -c qemu+tcp://192.168.122.1/system vol-info --vol /home/trking/VirtualMachines/coreos_base --pool default
Name:           coreos_base
Type:           file
Capacity:       16.00 GiB
Allocation:     1.55 GiB

So can we connect the dots between our config and the busted baseVolumeID? The actuator is getting the value from machine-provider config. Who writes that config? Maybe the machine-API operator using this template? A short-term patch is probably updating that template to use just coreos_base. A long-term fix is probably updating something (that same template?) to use a value pulled (possibly indirectly) from the cluster config the installer is pushing.

@wking
Copy link
Member

wking commented Sep 23, 2018

Possible fix in openshift/machine-api-operator#70.

@nhosoi
Copy link

nhosoi commented Jan 14, 2019

I ran into the similar issue with v0.9.1.

For an experiment, I replaced the following hardcoded path "/var/lib/libvirt/images" with my storage path (/home/VMpool).
https://github.com/openshift/installer/blob/master/pkg/asset/machines/libvirt/machines.go#L74

Then, my worker node image and its ignition file are placed in the storage path.

# ls /home/VMpool
ntest0-base  ntest0-master-0  ntest0-master.ign  ntest0-worker-0-4vlw2  ntest0-worker-0-4vlw2.ignition

But the worker node failed to start with this error:
W0113 17:45:53.363756 1 controller.go:183] unable to create machine ntest0-worker-0-4vlw2: ntest0/ntest0-worker-0-4vlw2: error creating libvirt machine: error creating domain Failed to setDisks: Can't retrieve volume /var/lib/libvirt/images/ntest0-worker-0-4vlw2

Obviously, it expects to see the worker node image ntest0-worker-0-4vlw2 in /var/lib/libvirt/images instead of my storage path /home/VMpool... But I cannot find the place /var/lib/libvirt/images is hardcoded or expected as a default path in the installer. Do you have any idea how I can workaround this issue?

Thanks!

@wking
Copy link
Member

wking commented Jan 14, 2019

But I cannot find the place /var/lib/libvirt/images is hardcoded...

openshift/cluster-api-provider-libvirt#45 (the successor to openshift/machine-api-operator#70 linked above).

@nhosoi
Copy link

nhosoi commented Jan 14, 2019

Thank you, @wking.

I hope openshift/cluster-api-provider-libvirt#45 is going to be merged and cluster-api-provider-libvirt will be rebuilt soon...

@zeenix zeenix mentioned this issue Mar 25, 2019
2 tasks
@steven-ellis
Copy link

This is still an outstanding issue. I'm using #1371 to bootstrap on libvirt and just hit this storage issue,

@markmc
Copy link
Contributor

markmc commented Apr 15, 2019

Looks like this is the issue: https://github.com/openshift/installer/blob/master/pkg/asset/machines/libvirt/machines.go#L71

                Volume: &libvirtprovider.Volume{
			PoolName:     "default",
			BaseVolumeID: fmt.Sprintf("/var/lib/libvirt/images/%s-base", clusterID),
		},

i.e. when the installer generating the provider spec for machines, it guesses what the volume ID generated by libvirt for the base image will be

The base image is created here https://github.com/openshift/installer/blob/master/data/data/libvirt/main.tf#L5

module "volume" {
  source = "./volume"

  cluster_id = "${var.cluster_id}"
  image      = "${var.os_image}"
}

and the volume id is referenced as ${module.volume.coreos_base_volume_id}

Probably the easiest solution is to allow configuring the volume in the provider spec with a name rather than volume id

i.e. right now we require:

      volume:
        poolName: default
        baseVolumeID: /var/lib/libvirt/images/coreos_base

but there's no reason to require the volume key if we just have the pool and the volume name. This should be sufficient:

      volume:
        poolName: default
        baseVolumeName: coreos_base

Of course, the actuator needs a patch to do virStorageVolLookupByName() in this case rather than the virStorageVolLookupByKey() it does now

@zeenix
Copy link
Contributor

zeenix commented Apr 15, 2019

Of course, the actuator needs a patch to do virStorageVolLookupByName() in this case rather than the virStorageVolLookupByKey() it does now

Correct. This was done in openshift/cluster-api-provider-libvirt#45 which I've finally rebased and reworked a bit. I'm going to test it today and create a new PR.

@steven-ellis
Copy link

Looks like the fix is now in openshift/cluster-api-provider-libvirt#144
I'm re-testing on my local environment off master.

@zeenix
Copy link
Contributor

zeenix commented May 21, 2019

@steven-ellis the libvirt actuator bit, yes but the Installer part is still not merged due to CI being flaky: #1628

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants