Provide cross-architecture Ignition QEMU support #928

jlebon · 2020-02-28T21:10:36Z

This is split out from discussions that started in #656.

To summarize the problem statement:

The root issue is that Ignition today does not require a config. So to do its job, it needs to be able to query whether there is a config. In the case of user-data over the network for example, it can simply wait until networking comes online and GET the URL. If it's a 404, it can move on.
There is no surefire way to know when the kernel has finished discovering all attached storage devices. This intuitively makes sense because many types of storage device are designed to be hotpluggable, and the architecture reflects this. Additionally, storage device hierarchy may be complex in some environments, resulting in a long discovery phase as more nested devices come online. The fw_cfg device works around this because it's memory-mapped, but it's only available on a subset of architectures.
Therefore, in the case of an Ignition config stored on a storage device, Ignition cannot tell the difference between "config was not provided" or "config was provided but the device hasn't yet come online".
It's also worth noting a lot of this pain also comes about from wanting to avoid extra supporting tools for launching VMs.

One suggestion then is to change the semantics of Ignition slightly on QEMU: we don't require an Ignition config, but we require the device through which we can determine whether an Ignition config was provided. So e.g. we can require a CD drive, even if there is no "CD" inserted (not actually suggesting this though, see below). Then we can afford to just wait forever until whatever device we want shows up. In a way, this is consistent with waiting until networking comes up.

Note that unlike most cloud services that have a metadata service for SSH keys, on QEMU there is no other way to configure the guest anyway. Therefore, while there might be some use cases for booting a QEMU guest without a config, they're likely very rare. (And again, note that this change doesn't make Ignition technically require a config, but simply the device itself.)

So, what's the best way to implement this? Three primary criteria:

it should be supported across architectures
it should be easily identifiable as "the Ignition device" from the guest side
it should be the minimal amount of set up/hassle for users

So for example, while virtio-serial is AFAIK supported across all arches, and can have custom names, it would require a shim on the host side to communicate with (unless something like suggested in #656 (comment) is implemented).

A CD drive would also work, but it requires users to create an ISO first, which is annoying.

What I'm playing with right now is a virtio-blk device with a serial of ignition:

host$ echo '{"foo": "bar"}' > config.ign
host$ cosa run -- -drive file=config.ign,if=none,format=raw,readonly=on,id=ignition -device virtio-blk,serial=ignition,drive=ignition
guest$ cat /dev/disk/by-id/virtio-ignition
{"foo": "bar"}

This more or less fits all three criteria by not requiring much preparation, being supported across architectures, and being easily identifiable from the guest side. The CLI is clearly less elegant than -fw_cfg, though ideally you'd be copy-pasting it anyway.

Of course, if you have other candidate devices that fit the criteria, definitely comment!

The text was updated successfully, but these errors were encountered:

jlebon · 2020-02-28T21:17:35Z

A follow-up question: even if we do something like this, clearly we have to keep supporting fw_cfg for a while to make transitioning easier, though should we eventually consider it deprecated? I personally lean towards "yes" because having a consistent mechanism across architectures would be really nice.

(And in fact, this could generalize to more hypervisors than just QEMU; basically anything that supports block devices with custom ids).

cgwalters · 2020-02-28T21:27:16Z

Right, there's no concerns with "did the device appear" with a qemu platform ID; we know to wait for it.

A virtio disk seems fine to me, though I would say we should e.g. error out fatally if the device is writable just to avoid mistakes in that?

jlebon · 2020-02-28T21:34:05Z

I'm going to reply to this comment here:

Additionally, there has been literally no effort in investigating how to do this in a non-hackish way with platform-specific capabilities of qemu-system-s390x (I guess because, to the best of my knowledge, #825 hasn't been properly prioritized & planned on the roadmap of openshift-s390x development team).

Right, I think we should exercise due diligence here and discuss with s390x SMEs before going with this approach. (Though note there is also ppc64le.)

jlebon · 2020-02-28T21:34:46Z

A virtio disk seems fine to me, though I would say we should e.g. error out fatally if the device is writable just to avoid mistakes in that?

Yup, seems reasonable to me.

Prashanth684 · 2020-02-28T21:40:04Z

This sounds like a very reasonable approach. I did talk to some qemu s390x SMEs this morning and they were saying they would have to investigate the possibility of a fw_cfg equivalent so it looks like it would not be anytime soon that there would be a solution for s390x/ppc64le. I will also experiment with this on s390x and ppc64le. thanks!

jlebon · 2020-02-28T21:55:07Z

On libvirt:

virt-install ... --disk path=$PWD/config.ign,format=raw,readonly=on,serial=ignition

Oh hey! That's shorter than --qemu-commandline="-fw_cfg name=opt/com.coreos/config,file=$PWD/config.ign". 🎉

Prashanth684 · 2020-02-28T22:20:41Z

hmm..i was using the coreos installer with the command line coreos.inst.install_dev=vda with this and it gave me:

[   10.759067] coreos-installer[812]: Mounting tmpfs
[   10.762663] coreos-installer[812]: Downloading install image
[   11.801143] coreos-installer[812]: 15%
[   12.840823] coreos-installer[812]: 30%
[   13.891982] coreos-installer[812]: 46%
[   14.951579] coreos-installer[812]: 61%
[   16.003197] coreos-installer[812]: 77%
[   17.067637] coreos-installer[812]: 93%
[   18.081391] coreos-installer[812]: Wiping /dev/vda
[   18.085491] coreos-installer[812]: Writing disk image
[   18.085717] coreos-installer[812]: Extracting disk image
[   37.943034] coreos-installer[812]: Mounting tmpfs
[   77.120206] coreos-installer[812]: Checking that no-one is using this disk right now ... FAILED
[   77.123679] coreos-installer[812]: This disk is currently in use - repartitioning is probably a bad idea.
[   77.124350] coreos-installer[812]: Umount all file systems, and swapoff all swap partitions on this disk.
[   77.133128] coreos-installer[812]: Use the --no-reread flag to suppress this check.
[   77.135027] coreos-installer[812]: sfdisk: Use the --force flag to overrule all checks.
[   77.153159] coreos-installer[812]: dd: /dev/vda: cannot seek: Invalid argument
[   77.158735] coreos-installer[812]: failed to write image to zFCP SCSI disk
Usage: /usr/libexec/coreos-installer [options]

should we have a separate disk for this?

bgilbert · 2020-02-28T23:09:10Z

@Prashanth684 Maybe I'm missing something; is that relevant to this issue?

I agree that if we hard-require a userdata device we can avoid the race condition. It's a tempting idea, but I'm not 100% sold yet that it's the right approach.

@jlebon Did you have a chance to look at PCI devices at all?

Prashanth684 · 2020-02-28T23:12:51Z

@Prashanth684 Maybe I'm missing something; is that relevant to this issue?

Ah..correct ..this use case doesn't matter here ..my bad...thanks for clarifying.

darkmuggle · 2020-02-29T00:32:48Z

From my vantage point -- I like the idea of a disk simply because the model is easy to implement. Its:

cross platform
cross distro

For some distributions and even versions of Qemu, the firmware might be missing (kernel support). IMHO, this generalizes the requirements more.

lucab · 2020-02-29T11:07:39Z

@jlebon thanks for summing up the whole topic!

Bunch of random comments:

the virtio-serial path is IMHO not nice to port/automate due to the shim requirement.
the "always wait for disk" doesn't sound like a bad idea. Though, we still need to cap and time-out in order to fail the service-unit and proceed to the emergency target.
this is not very different from the current virtualbox provider, which however uses a GPT disk. See providers/virtualbox: investigate using GuestProperties #629 for more.
I'm unsure if the "raw JSON on disk" is going to confuse other things in the guest when scanning for disks/partitions

bgilbert · 2020-03-02T05:35:23Z

the "always wait for disk" doesn't sound like a bad idea. Though, we still need to cap and time-out in order to fail the service-unit and proceed to the emergency target.

The timeout is what we're trying to avoid, though. In principle, probing all disks in the system can take an arbitrary amount of time and no useful timeout value is safe.

I'm unsure if the "raw JSON on disk" is going to confuse other things in the guest when scanning for disks/partitions

I like the GPT approach of the VirtualBox provider for that reason, but it does require additional host-side scripting to implement.

berrange · 2020-03-02T09:20:42Z

the "always wait for disk" doesn't sound like a bad idea. Though, we still need to cap and time-out in order to fail the service-unit and proceed to the emergency target.

The timeout is what we're trying to avoid, though. In principle, probing all disks in the system can take an arbitrary amount of time and no useful timeout value is safe.

Perhaps you can you rely on the timeout for the general fully cross-arch portable case, but have a side channel to let you optimize specific cases. eg a kernel command line arg can tell you it definitely exists, and thus should be waited for with no timeout, and/or on x86 & aarch64 an SMBIOS field can tell you it definitely exists. This is essentially the approach that cloud-init takes - it looks at some well known defaults, but a kernel command line arg can give it an explicit place to look.

I'm unsure if the "raw JSON on disk" is going to confuse other things in the guest when scanning for disks/partitions

I like the GPT approach of the VirtualBox provider for that reason, but it does require additional host-side scripting to implement.

To identify which disk to use, you'll need some identifier, which comes down to a choice between disk serial string, or filesystem uuid or filesystem label. If you want raw JSON on disk, then disk serial is your required identifier.

crawford · 2020-03-05T17:21:02Z

I don't like the idea of always requiring an Ignition config because that immediately rules out using an OS with Ignition for hardware discovery (which is a very important aspect of bare metal provisioning).

Have we ruled out "fixing" QEMU? Most commercial hypervisors provide a host-guest bridge for passing data back and forth.

Prashanth684 · 2020-03-05T17:43:05Z

Have we ruled out "fixing" QEMU? Most commercial hypervisors provide a host-guest bridge for passing data back and forth.

I did have a chat with some QEMU folks in IBM and it looks like they have to refactor certain drivers and define a new transport to achieve this. I have captured this in a BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1810678

berrange · 2020-03-05T18:23:15Z

Have we ruled out "fixing" QEMU? Most commercial hypervisors provide a host-guest bridge for passing data back and forth.

QEMU provides multiple host-guest bridges for passing data back and forth. There is virtio-serial and virtio-vsock. The vsock device actually comes from VMWare - QEMU merely added a virtio transport for VMWare's existing device type here. The problem with any type of device like this is that it will require waiting for the guest OS to probe and hardware, initialize the device and expose it to userspace. Any other device QEMU might implement is going to have similar issues in the guest SO with device initialization.

If ignition isn't able to wait for these device init actions to take place, then the only option left is to rely on well defined memory regions initialized by the firmware such as fw_cfg and SMBIOS, none of which have 100% platform portability.

jlebon · 2020-03-05T18:26:16Z

I don't like the idea of always requiring an Ignition config because that immediately rules out using an OS with Ignition for hardware discovery (which is a very important aspect of bare metal provisioning).

Can you expand on this? Do you mean automatically figuring out which platform we're running on?

But again, note that this change isn't making the Ignition config required. It's only making the medium on which to check for a config required. I know this seems like a silly distinction, but I think it's important.

Have we ruled out "fixing" QEMU? Most commercial hypervisors provide a host-guest bridge for passing data back and forth.

There are multiple bridges available. However, the subset of those which aren't a pain to use is much smaller. (See some of the criteria in #928 (comment)).

crawford · 2020-03-05T19:44:12Z

The problem with any type of device like this is that it will require waiting for the guest OS to probe and hardware, initialize the device and expose it to userspace.

But these devices aren't generally hot plugged, right? We just need to wait for the kernel and udev to iterate over the devices at least one time. This is quite different from a CDROM, for example, since some platforms (and maybe users) dynamically add and remove media. In that case, we just don't know how long we need to wait.

I don't like the idea of always requiring an Ignition config because that immediately rules out using an OS with Ignition for hardware discovery (which is a very important aspect of bare metal provisioning).

Can you expand on this? Do you mean automatically figuring out which platform we're running on?

The use case here is one where a customer is using an image without an Ignition config, with the intent of discovering the hardware (e.g. the name of the network interfaces, disks, etc.).

But again, note that this change isn't making the Ignition config required. It's only making the medium on which to check for a config required. I know this seems like a silly distinction, but I think it's important.

Not silly at all. In fact, I'm not sure I fully understand the distinction. Are you saying that we would potentially require the media to be plugged, even if it doesn't contain a config (so that we may be able to confidently say "there is no config now and there will never be one")? If so, I think this is a convenient model, as long as the UX is reasonable. If I have to go out of my way to provide empty media to a machine, that's not much easier than providing a no-op config.

jlebon · 2020-03-05T20:05:47Z

But these devices aren't generally hot plugged, right? We just need to wait for the kernel and udev to iterate over the devices at least one time.

And how does one know if the kernel finished iterating over the devices once? :) udevadm --settle for example will just wait until events are handled. That's AFAIK totally separate from whether e.g. a kernel driver is just being slow to probe a device.

The use case here is one where a customer is using an image without an Ignition config, with the intent of discovering the hardware (e.g. the name of the network interfaces, disks, etc.).

Ahh OK. Yeah, this issue doesn't change that for bare metal machines. It's only scoped to QEMU machines (though even on bare metal, now that we don't have coreos.autologin, how do we expect this workflow to work without providing an Ignition config?).

Are you saying that we would potentially require the media to be plugged, even if it doesn't contain a config (so that we may be able to confidently say "there is no config now and there will never be one")? If so, I think this is a convenient model, as long as the UX is reasonable.

Yup, exactly!

If I have to go out of my way to provide empty media to a machine, that's not much easier than providing a no-op config.

Right, that's a tricky part. With a CD-ROM for example, there's ENOMEDIUM. For a block device, I think the closest is just using /dev/null as the source. (But yes, you'd still have to type e.g. --disk path=/dev/null,format=raw,readonly=on,serial=ignition).

But again, note that on QEMU at least, there is no other way to provision machines anyway. So the use case for not providing an Ignition config is pretty small.

jlebon · 2020-03-05T21:05:30Z

@jlebon Did you have a chance to look at PCI devices at all?

I looked a bit, but didn't turn up much. So you're looking for a generic PCI device with e.g. an Ignition-specific device ID which somehow adds some I/O interface through which to read the Ignition config?

the "always wait for disk" doesn't sound like a bad idea. Though, we still need to cap and time-out in order to fail the service-unit and proceed to the emergency target.

The timeout is what we're trying to avoid, though. In principle, probing all disks in the system can take an arbitrary amount of time and no useful timeout value is safe.

The key difference though is that the timeout we're trying to avoid is about "no config provided vs config provided". It could still be worth having a "something is obviously wrong" timeout with a much longer value than would be reasonable to expect from how long it'd take a QEMU bootup to find devices. This gives the chance for sysadmins to probe around and troubleshoot things. Is 5m unreasonable?

I'm unsure if the "raw JSON on disk" is going to confuse other things in the guest when scanning for disks/partitions

I like the GPT approach of the VirtualBox provider for that reason, but it does require additional host-side scripting to implement.

Yeah, I agree it's not the most elegant thing. For FCOS, this shows up in dmesg:

[    2.709843] Dev vdb: unable to read RDB block 1
[    2.710525]  vdb: unable to read partition table
[    2.711467] vdb: partition table beyond EOD, truncated

But otherwise, it works just fine. Overall, feels like it's still better than shipping a cross-platform tool that users have to run to create a config ISO?

berrange · 2020-03-06T09:55:48Z

@jlebon Did you have a chance to look at PCI devices at all?

I looked a bit, but didn't turn up much. So you're looking for a generic PCI device with e.g. an Ignition-specific device ID which somehow adds some I/O interface through which to read the Ignition config?

If by "I/O interface" you mean something RPC-like, then virtio-serial or virtio-vsock fit, but they require a daemon on the host to actually provide the data which is a bit tedious if you're just providing a single static data file.

A hard disk is appealing because it trivially exposes a raw data file.

Or a type of PCI device that exposes a memory region - possibly virtio-pmem, though I'm not sure how you'd identify the memory region from userspace.

A further idea that's been suggested is to abuse the PCI option ROM feature, essentially provide a PCI device that implements no functionality, but use the option ROM to expose a static data blob. It is not clear if this works on s390x, but it would work on x86, ppc & arm with PCI.

But these devices aren't generally hot plugged, right? We just need to wait for the kernel and udev to iterate over the devices at least one time.

And how does one know if the kernel finished iterating over the devices once? :) udevadm --settle for example will just wait until events are handled. That's AFAIK totally separate from whether e.g. a kernel driver is just being slow to probe a device.

There's two different waiting stages - the PCI probing, and the userspace device setup. I think you can reasonably assume the kernel has probed all cold booted PCI devices by the time the initrd is executing. What takes longer is then setting up the logical devices associated with them in userspace, such as the /dev/sd* disk nodes, or the network interfaces, of the virtio-serial channels devices.

If ignition can look at the PCI devices present and determine from that alone, whether the right device exists, then it will know whether it is OK to then wait indefinitely for the disk nodes or network interfaces to appear. This would require some unique identifier for the raw PCI device. For example the PCI subsystem product ID could possibly be (ab)used for this purpose

jlebon · 2020-03-06T16:08:44Z

A further idea that's been suggested is to abuse the PCI option ROM feature, essentially provide a PCI device that implements no functionality, but use the option ROM to expose a static data blob.
...
This would require some unique identifier for the raw PCI device. For example the PCI subsystem product ID could possibly be (ab)used for this purpose

I think I'm missing something fundamental here: how does one provide a custom PCI device? Doesn't that require rebuilding QEMU? Or is there some sort of drop-in thing it supports?

A custom QEMU device sounds like a cool idea, but if it requires shipping additional blobs that users have to download and install just to run FCOS, I don't think it's worth it.

There's two different waiting stages - the PCI probing, and the userspace device setup. I think you can reasonably assume the kernel has probed all cold booted PCI devices by the time the initrd is executing. What takes longer is then setting up the logical devices associated with them in userspace, such as the /dev/sd* disk nodes, or the network interfaces, of the virtio-serial channels devices.

Thanks, that's useful information. When you say "userspace device setup", are you referring strictly to udev, or is there also a delay for things to show up in sysfs?

Add experimental support for fetching Ignition configs via a virtio block device with serial/ID `ignition`. The main advantage of this is that it is cross-platform. But for now, we only use it on platforms which don't support the QEMU firmware config approach, which are (of those we care about) s390x and ppc64le. We may end up using it across the remaining platforms. See related discussions in: coreos#928

jlebon · 2020-03-06T16:16:09Z

So here's a suggestion: we're currently in dire need of getting something to work for better OCP CI coverage on s390x. It doesn't have to be stabilized and we don't expect production use of whatever mechanism we choose.

Thus, I think this is a good opportunity to try out this approach to see how it looks and get feedback from it. I've updated #905 for this. I've tested it successfully on amd64 (by tweaking the conditional build flags).

Add experimental support for fetching Ignition configs via a virtio block device with serial/ID `ignition`. The main advantage of this is that it is cross-platform. But for now, we only use it on platforms which don't support the QEMU firmware config approach, which are (of those we care about) s390x and ppc64le. We may end up using it across the remaining platforms. See related discussions in: coreos#928

berrange · 2020-03-06T16:20:44Z

I think I'm missing something fundamental here: how does one provide a custom PCI device? Doesn't that require rebuilding QEMU? Or is there some sort of drop-in thing it supports?

A custom QEMU device sounds like a cool idea, but if it requires shipping additional blobs that users have to download and install just to run FCOS, I don't think it's worth it.

Yes, this would be brand new code that would have to be written for QEMU, so wouldn't work for any existing deployments of QEMU. I can understand if that is likely to be a practical roadblock

Thanks, that's useful information. When you say "userspace device setup", are you referring strictly to udev, or is there also a delay for things to show up in sysfs?

Mostly I think it'll be udev related, but possibly kernel related. eg the kernel probes PCI devices, and they get attached to kernel drivers. Those kernel drivers then have to probe and initialize the devices. eg a virtio-blk PCI device needs get the virtio-blk block driver attached, which will then create the /sys/block/NNN entry, and which then emit events for udev. So its actually three stages really.

jlebon · 2020-03-06T17:13:13Z

Mostly I think it'll be udev related, but possibly kernel related. eg the kernel probes PCI devices, and they get attached to kernel drivers. Those kernel drivers then have to probe and initialize the devices. eg a virtio-blk PCI device needs get the virtio-blk block driver attached, which will then create the /sys/block/NNN entry, and which then emit events for udev. So its actually three stages really.

Thanks, I read some more driver code and that's my understanding as well. So IOW, even if we make the assumption that the kernel finished registering all the PCI devices by the time we run, (1) the drivers might not be loaded yet, (2) the drivers might not have been assigned to their PCI devices yet (though it seems like this happens synchronously with (1) reading the code for driver_register), or (3) the drivers might not have finished probing the devices yet.

…ignition for s390x/ppc64le Similar to dmacvicar/terraform-provider-libvirt#718 The method of mimicking what Openstack does for injecting ignition config works for images which have the provider as Openstack because ignition recognizes the platform and knows it has to get the ignition config from the config drive. For QEMU images, ignition supports getting the config from the firmware config device which is not supported by ppc64 and s390x. The workaround we have used thus far is to use the Openstack image on the QEMU platform but have this provider create the iso containing the ignition config. There was a discussion in ignition (coreos/ignition#928) to have a more QEMU based method of injecting ignition config and it was decided to use a virtio-blk device with a serial of ignition which ignition can recognize. This was mainly because with external devices, it is hard to tell if there is an issue with the device or if the kernel has not detected it yet if it has a long discovery phase. This PR replaces the the method of ignition injection from a conig drive disk to a virtio-blk device which is specified through the QEMU command line options. Reference PR which supports ignition fetching through virtio-blk for QEMU: coreos/ignition#936

…ignition for s390x/ppc64le Similar to dmacvicar/terraform-provider-libvirt#718 The method of mimicking what Openstack does for injecting ignition config works for images which have the provider as Openstack because ignition recognizes the platform and knows it has to get the ignition config from the config drive. For QEMU images, ignition supports getting the config from the firmware config device which is not supported by ppc64 and s390x. The workaround we have used thus far is to use the Openstack image on the QEMU platform but have this provider create the iso containing the ignition config. There was a discussion in ignition (coreos/ignition#928) to have a more QEMU based method of injecting ignition config and it was decided to use a virtio-blk device with a serial of ignition which ignition can recognize. This was mainly because with external devices, it is hard to tell if there is an issue with the device or if the kernel has not detected it yet if it has a long discovery phase. This PR replaces the the method of ignition injection from a conig drive disk to a virtio-blk device. Reference PR which supports ignition fetching through virtio-blk for QEMU: coreos/ignition#936

…90x/ppc64le Similar to dmacvicar/terraform-provider-libvirt#718 The method of mimicking what Openstack does for injecting ignition config works for images which have the provider as Openstack because ignition recognizes the platform and knows it has to get the ignition config from the config drive. For QEMU images, ignition supports getting the config from the firmware config device which is not supported by ppc64 and s390x. The workaround we have used thus far is to use the Openstack image on the QEMU platform but have this provider create the iso containing the ignition config. There was a discussion in ignition (coreos/ignition#928) to have a more QEMU based method of injecting ignition config and it was decided to use a virtio-blk device with a serial of ignition which ignition can recognize. This was mainly because with external devices, it is hard to tell if there is an issue with the device or if the kernel has not detected it yet if it has a long discovery phase. This PR reverts the ISO method used by Openstack and just creates a virtio-blk device through the QEMU command line options. Reference PR which supports ignition fetching through virtio-blk for QEMU: coreos/ignition#936

jlebon · 2020-06-08T15:17:33Z

So the new Ignition block device approach is now being used with success in RHCOS.

I'm thinking we should take the next step here towards having it used across all arches and eventually deprecating fw_cfg. Here's what I have in mind:

We change Ignition on fw_cfg arches to first look for fw_cfg but allow ENOENT and automatically falling back to the block device.
Change all our tools/CI processes we control to leverage the block device approach. This further validates the approach.
Change all documentation to mention the block device only, and drop mention of fw_cfg.
After some time, mark fw_cfg as deprecated, emitting a warning (and maybe e.g. sleep(10)) when it is detected.
After some more time, we drop fw_cfg support entirely.

We've had some good success with the new block device approach on s390x and ppc64le. Let's now make it available on all arches. This is the first step towards eventually deprecating `fw_cfg`: coreos#928 (comment) Early adopters will benefit from simplified code which works across all architectures, as well as better compatibility with other virtualization tooling like libvirt, which today doesn't expose `fw_cfg` easily (though there's work in flight for that).

jlebon · 2020-06-11T19:52:50Z

We change Ignition on fw_cfg arches to first look for fw_cfg but allow ENOENT and automatically falling back to the block device.

#999.

cgwalters · 2022-05-05T14:40:28Z

I don't like the idea of always requiring an Ignition config because that immediately rules out using an OS with Ignition for hardware discovery (which is a very important aspect of bare metal provisioning).

Isn't that addressed by us providing a Live ISO which explicitly does not require a config and even automatically falls open to an auto-login shell?

We're shipping that today for OCP/RHCOS for this exact use case of hardware discovery.

That said, in practice these hardware discovery flows actually do want to be automated and so the assisted flow injects Ignition into the ISO.

But the point remains that the ISO serves this "no config" use case.

bgilbert · 2022-05-05T22:25:30Z

I don't like the idea of always requiring an Ignition config because that immediately rules out using an OS with Ignition for hardware discovery (which is a very important aspect of bare metal provisioning).

Isn't that addressed by us providing a Live ISO which explicitly does not require a config and even automatically falls open to an auto-login shell?

No, because that flow is specific to FCOS/RHCOS. Other distros that use Ignition may have a different usage model.

cgwalters · 2022-05-05T22:54:55Z

FCOS does not have an exclusive patent on creating ISOs 😄 Anyone can do it, why wouldn't we ("we" in the Ignition upstream sense) recommend doing that type of thing? Or really the generalization of this is a separate image (of any form, but really this is about bare metal so ISO makes the most sense).

bgilbert · 2022-05-05T23:13:58Z

Ignition is useful for a wider range of use cases than just a traditional server distro where a live image makes sense. E.g. embedded cases will have different constraints. (And as a practical matter, Ignition is already waaaay too difficult to integrate into a distro, and "you really should have a live ISO" is a pretty large ask. It took an enormous amount of work to build our live implementation.)

We should be careful not to view Ignition too parochially through the lens of CoreOS's needs. Ignition has always allowed users to boot without an Ignition config on every platform, it's a useful feature, and changing it shouldn't be done lightly.

cgwalters · 2022-05-06T01:45:08Z

Fair. Though, what about supporting a file like /etc/ignition-block-on-config (could be JSON/YAML/keyfile/whatever config file) that can be injected into the initramfs that defaults to no but can be toggled to timeout: 1m or race e.g. So anyone making a Linux system that uses this and wants to have one image (AMI/qcow2/ISO/whatever) that blocks on config and one that doesn't can just drop a config file into the initramfs for one of the images?

Another approach: Support an easily editable outside the disk mechanism to inject config beyond just ISOs. Maybe something like a GPT flag or so - something that can be done completely unprivileged using e.g. sfdisk or whatever on a raw disk image without actually spawning something like libguestfs. Much like coreos-installer iso ignition. Or even not actually the config, a single bit flag for whether or not to fetch a config via virtio-blk, defaulting to on or off per distro. But such a mechanism would then make it easy to have a single .qcow2 and flip off the requirement for Ignition.

(Though to make this actually useful, the OS/distro does need to detect this at a higher level and e.g. do autologin on the tty or so beyond Ignition, or I guess Ignition could auto-include a distro config to do this if the flag is detected)

cgwalters · 2022-05-06T13:07:39Z

OK hopefully everyone agrees that having qemu semantics be architecture-specific is a problem. This issue came up again because a test started failing only on s390x. We caught it in the development stream, but still, as of now it would block shipping OS updates on the affected platforms, but not x86_64 which is obviously a problem. We should strive to minimize unnecessary architecture specifics.

So there are two paths: support something like fw_cfg on all platforms which was discussed, or support making qemu Ignition mandatory on all platforms by default.

Now I will strive to keep somewhat separate the upstream Ignition from CoreOS in this conceptually...but I think most other users of Ignition are basically using it in very similar ways anyways.

I will admit that I have in the past provided no Ignition to e.g. a FCOS AWS instance, and relied on afterburn ssh key to get a shell.

But outside of cases that have some other cloud-specific OOB mechanism to configure some sort of basic login or special images like the ISO that have defined useful default Ignition...I can't think of use cases for "no Ignition". Specifically the qemu image we ship in FCOS and derivatives.

So short term, I'd advocate for us changing qemu across all platforms to require Ignition (and try virtio first, falling back to fw_cfg on x86_64).

Ultimately I think qemu is a weird special case. Don't get me wrong, I am of two minds on this 😄 I think what we've done CoreOS side in heavily investing in having "basic sanity checking" using qemu makes a lot of sense. But OTOH - most real "qemu" uses should actually be using a real hypervisor wrapping it (whether libvirt or openstack, etc.). This brings a lot more power and capability. And OpenStack specifically then has the metadata service. It's actually doable to run a metadata service in libvirt even.

In OpenShift development there is some support for the CoreOS qemu model, but in practice it has bitrotted and most people who are doing virtualized work are trying to do so to emulate bare metal installation, so the basically everyone doing real work like that has switched to using libvirt for other reasons (much more control over networking, can simulate real bare metal ISOs etc.)

Long term, I think it would be really nice still to somehow support "race free optional metadata" for qemu. It's quite surprising it's so hard. I think the most viable path is making it easy to inject metadata into a qcow2, and drive that support into e.g. qemu-img, much like what we have with coreos-installer iso ignition embed.

cgwalters · 2023-02-15T16:20:23Z

One thing that came up today is that systemd grew code to read the fw_cfg too: https://github.com/systemd/systemd/blob/fff1edc9f9069ff2ef58a714355b61002e30f305/docs/CREDENTIALS.md?plain=1#L53

So there's potentially a stronger argument to generalize support for fw_cfg or equivalent in qemu across architectures as there are multiple readers.

berrange · 2023-02-15T16:42:56Z

One thing that came up today is that systemd grew code to read the fw_cfg too: https://github.com/systemd/systemd/blob/fff1edc9f9069ff2ef58a714355b61002e30f305/docs/CREDENTIALS.md?plain=1#L53

So there's potentially a stronger argument to generalize support for fw_cfg or equivalent in qemu across architectures as there are multiple readers.

On the QEMU side, the view remains that apps should be using OEM strings for user data injection to the OS, not fw_cfg. systemd supports both, but neither are a portable solution across architectures. I thought ignition had support for using a virtual disk as a portable solution across architectures ?

With the introduction of confidential virtualization, this likely gets more complex still, because neither fw_cfg/OEM strings are trustworthy sources of information. They can also only be populated when the QEMU process is first started, and it is undesirable to release potentially sensitive data upfront. It is possible that initial boot customization data is going to need to be fetched early in boot from a remote attestation service, in response to submitting an attestation report of the VM to prove its confidentiality. This might be a direct network connection from the guest to the attestation service, or it could be a connection that is mediated by an opaque proxy in the host, as yet undetermined.

bgilbert · 2023-02-15T16:53:33Z

Neither OEM strings nor fw_cfg are supported on every relevant CPU architecture. Either one could work, but we'd need it to be generalized to work everywhere.

There is experimental support for using a virtual disk, but we can't stabilize it as-is. As previously discussed, there's an inherent race condition between storage device probing (which can be arbitrarily slow on a large/heavily loaded system) and the timeout that would be needed to successfully boot if the user doesn't provide a config.

What makes those two information sources untrustworthy? I agree that it's undesirable to include sensitive information in them, but they would be useful for configuring the URL to an attestation service.

berrange · 2023-02-15T17:08:48Z

Neither OEM strings nor fw_cfg are supported on every relevant CPU architecture. Either one could work, but we'd need it to be generalized to work everywhere.

Never say never, but my feeling it is is pretty unlikely either will become supported on every arch.

What makes those two information sources untrustworthy? I agree that it's undesirable to include sensitive information in them, but they would be useful for configuring the URL to an attestation service.

This is quite a complex topic, but I'll try to give a overview

With confidential virtualization, the host/hypervisor administrator / OS is considered untrusted. Only the physical hardware / its vendor is trusted, and you can prove that via attestation of the VM execution environment. The implication is that any data that is mediated by the host also be considered untrusted. ie you can ask the hypervisor expose a given block or data, but it can substitute its own data instead, so as to try to compromise the VM.

These extends to essentially every piece of virtualized hardware in the VM. Each device or data channel needs to have a defined way to apply encryption before it can be trusted for us. For network of course TLS is already widely used. For disks, LUKS can be used, though many cipher modes are significantly degraded if the attacker can observe repeated I/O to the same sector. In the case of LUKS, we need to have a secure way to get the keyslot passpharse. The admin can't ever interactively type it on the virtual keyboard as key presses can be logged by the hypervisor. The approach Azure (and likely KVM) take is to provide a virtual TPM running inside guest context against which the LUKS passphrase was previously sealed, and the TPM state is unlocked by talking to an attestation server early in boot.

So lets say we did want to keep using fw_cfg/SMBIOS in a confidential VM. The data provided that way would need to be encrypted. Then there would need to be another mechanism to acquire the decryption key, either by directly talking to an attestation server, or perhaps by leverging a vTPM like we'll probably do for LUKS. Alternatively the ignition (or cloud-init or systemd creds) data can be handled by the attestation service directly avoiding the use of fwcfg/SMBIOS.

This was referenced Feb 28, 2020

Consider reading Ignition configs from SMBIOS OEM strings #656

Closed

providers/qemu: support Ignition block device on s390x and ppc64le #905

Merged

Prashanth684 mentioned this issue Mar 27, 2020

Bug 1818149: Use virtio disk instead of config drive for injecting ignition for s390x/ppc64le openshift/cluster-api-provider-libvirt#189

Merged

jlebon mentioned this issue May 27, 2020

kola/qemu: Pass ignition via blk device on non-fw_cfg platforms by default coreos/coreos-assembler#1484

Merged

This was referenced Jun 8, 2020

providers/virtualbox: investigate using GuestProperties #629

Closed

Initramfs network configuration coreos/fedora-coreos-tracker#460

Closed

jlebon mentioned this issue Jun 11, 2020

providers/qemu: support Ignition block device on all arches #999

Closed

jlebon mentioned this issue Jun 19, 2020

Support disabling initramfs networking via Ignition #979

Closed

jlebon mentioned this issue Sep 15, 2020

openstack: metadata fetcher may stop retrying before network comes up #1081

Closed

jlebon closed this as completed May 9, 2022

jlebon reopened this May 9, 2022

dustymabe mentioned this issue Sep 27, 2022

failed to boot fedora coreos on s390x: fw_cfg option doesn't work coreos/fedora-coreos-tracker#1307

Closed

cfergeau mentioned this issue Nov 22, 2022

Add features needed by podman-machine crc-org/vfkit#19

Closed

6 tasks

bgilbert mentioned this issue Feb 14, 2023

Continue with empty config on missing QEMU device #1556

Closed

cfergeau mentioned this issue May 2, 2023

Ignition support crc-org/vfkit#35

Closed

jlebon mentioned this issue Jul 18, 2024

kola: add IBM CEX device test for the s390x build coreos/coreos-assembler#3828

Open

Provide cross-architecture Ignition QEMU support #928

Provide cross-architecture Ignition QEMU support #928

Comments

jlebon commented Feb 28, 2020

jlebon commented Feb 28, 2020

cgwalters commented Feb 28, 2020

jlebon commented Feb 28, 2020

jlebon commented Feb 28, 2020

Prashanth684 commented Feb 28, 2020

jlebon commented Feb 28, 2020

Prashanth684 commented Feb 28, 2020 • edited Loading

bgilbert commented Feb 28, 2020

Prashanth684 commented Feb 28, 2020 • edited Loading

darkmuggle commented Feb 29, 2020

lucab commented Feb 29, 2020

bgilbert commented Mar 2, 2020

berrange commented Mar 2, 2020

crawford commented Mar 5, 2020

Prashanth684 commented Mar 5, 2020 • edited Loading

berrange commented Mar 5, 2020

jlebon commented Mar 5, 2020

crawford commented Mar 5, 2020

jlebon commented Mar 5, 2020

jlebon commented Mar 5, 2020

berrange commented Mar 6, 2020

jlebon commented Mar 6, 2020

jlebon commented Mar 6, 2020

berrange commented Mar 6, 2020

jlebon commented Mar 6, 2020

jlebon commented Jun 8, 2020 • edited Loading

jlebon commented Jun 11, 2020

cgwalters commented May 5, 2022

bgilbert commented May 5, 2022

cgwalters commented May 5, 2022

bgilbert commented May 5, 2022

cgwalters commented May 6, 2022

cgwalters commented May 6, 2022

cgwalters commented Feb 15, 2023

berrange commented Feb 15, 2023

bgilbert commented Feb 15, 2023

berrange commented Feb 15, 2023

Prashanth684 commented Feb 28, 2020 •

edited

Loading

Prashanth684 commented Feb 28, 2020 •

edited

Loading

Prashanth684 commented Mar 5, 2020 •

edited

Loading

jlebon commented Jun 8, 2020 •

edited

Loading