Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide cross-architecture Ignition QEMU support #928

Open
jlebon opened this issue Feb 28, 2020 · 37 comments
Open

Provide cross-architecture Ignition QEMU support #928

jlebon opened this issue Feb 28, 2020 · 37 comments
Labels
jira for syncing to jira

Comments

@jlebon
Copy link
Member

jlebon commented Feb 28, 2020

This is split out from discussions that started in #656.

To summarize the problem statement:

  1. The root issue is that Ignition today does not require a config. So to do its job, it needs to be able to query whether there is a config. In the case of user-data over the network for example, it can simply wait until networking comes online and GET the URL. If it's a 404, it can move on.
  2. There is no surefire way to know when the kernel has finished discovering all attached storage devices. This intuitively makes sense because many types of storage device are designed to be hotpluggable, and the architecture reflects this. Additionally, storage device hierarchy may be complex in some environments, resulting in a long discovery phase as more nested devices come online. The fw_cfg device works around this because it's memory-mapped, but it's only available on a subset of architectures.
  3. Therefore, in the case of an Ignition config stored on a storage device, Ignition cannot tell the difference between "config was not provided" or "config was provided but the device hasn't yet come online".
  4. It's also worth noting a lot of this pain also comes about from wanting to avoid extra supporting tools for launching VMs.

One suggestion then is to change the semantics of Ignition slightly on QEMU: we don't require an Ignition config, but we require the device through which we can determine whether an Ignition config was provided. So e.g. we can require a CD drive, even if there is no "CD" inserted (not actually suggesting this though, see below). Then we can afford to just wait forever until whatever device we want shows up. In a way, this is consistent with waiting until networking comes up.

Note that unlike most cloud services that have a metadata service for SSH keys, on QEMU there is no other way to configure the guest anyway. Therefore, while there might be some use cases for booting a QEMU guest without a config, they're likely very rare. (And again, note that this change doesn't make Ignition technically require a config, but simply the device itself.)

So, what's the best way to implement this? Three primary criteria:

  1. it should be supported across architectures
  2. it should be easily identifiable as "the Ignition device" from the guest side
  3. it should be the minimal amount of set up/hassle for users

So for example, while virtio-serial is AFAIK supported across all arches, and can have custom names, it would require a shim on the host side to communicate with (unless something like suggested in #656 (comment) is implemented).

A CD drive would also work, but it requires users to create an ISO first, which is annoying.

What I'm playing with right now is a virtio-blk device with a serial of ignition:

host$ echo '{"foo": "bar"}' > config.ign
host$ cosa run -- -drive file=config.ign,if=none,format=raw,readonly=on,id=ignition -device virtio-blk,serial=ignition,drive=ignition
guest$ cat /dev/disk/by-id/virtio-ignition
{"foo": "bar"}

This more or less fits all three criteria by not requiring much preparation, being supported across architectures, and being easily identifiable from the guest side. The CLI is clearly less elegant than -fw_cfg, though ideally you'd be copy-pasting it anyway.

Of course, if you have other candidate devices that fit the criteria, definitely comment!

@jlebon
Copy link
Member Author

jlebon commented Feb 28, 2020

A follow-up question: even if we do something like this, clearly we have to keep supporting fw_cfg for a while to make transitioning easier, though should we eventually consider it deprecated? I personally lean towards "yes" because having a consistent mechanism across architectures would be really nice.

(And in fact, this could generalize to more hypervisors than just QEMU; basically anything that supports block devices with custom ids).

@cgwalters
Copy link
Member

Right, there's no concerns with "did the device appear" with a qemu platform ID; we know to wait for it.

A virtio disk seems fine to me, though I would say we should e.g. error out fatally if the device is writable just to avoid mistakes in that?

@jlebon
Copy link
Member Author

jlebon commented Feb 28, 2020

I'm going to reply to this comment here:

Additionally, there has been literally no effort in investigating how to do this in a non-hackish way with platform-specific capabilities of qemu-system-s390x (I guess because, to the best of my knowledge, #825 hasn't been properly prioritized & planned on the roadmap of openshift-s390x development team).

Right, I think we should exercise due diligence here and discuss with s390x SMEs before going with this approach. (Though note there is also ppc64le.)

@jlebon
Copy link
Member Author

jlebon commented Feb 28, 2020

A virtio disk seems fine to me, though I would say we should e.g. error out fatally if the device is writable just to avoid mistakes in that?

Yup, seems reasonable to me.

@Prashanth684
Copy link
Contributor

This sounds like a very reasonable approach. I did talk to some qemu s390x SMEs this morning and they were saying they would have to investigate the possibility of a fw_cfg equivalent so it looks like it would not be anytime soon that there would be a solution for s390x/ppc64le. I will also experiment with this on s390x and ppc64le. thanks!

@jlebon
Copy link
Member Author

jlebon commented Feb 28, 2020

On libvirt:

virt-install ... --disk path=$PWD/config.ign,format=raw,readonly=on,serial=ignition

Oh hey! That's shorter than --qemu-commandline="-fw_cfg name=opt/com.coreos/config,file=$PWD/config.ign". 🎉

@Prashanth684
Copy link
Contributor

Prashanth684 commented Feb 28, 2020

hmm..i was using the coreos installer with the command line coreos.inst.install_dev=vda with this and it gave me:

[   10.759067] coreos-installer[812]: Mounting tmpfs
[   10.762663] coreos-installer[812]: Downloading install image
[   11.801143] coreos-installer[812]: 15%
[   12.840823] coreos-installer[812]: 30%
[   13.891982] coreos-installer[812]: 46%
[   14.951579] coreos-installer[812]: 61%
[   16.003197] coreos-installer[812]: 77%
[   17.067637] coreos-installer[812]: 93%
[   18.081391] coreos-installer[812]: Wiping /dev/vda
[   18.085491] coreos-installer[812]: Writing disk image
[   18.085717] coreos-installer[812]: Extracting disk image
[   37.943034] coreos-installer[812]: Mounting tmpfs
[   77.120206] coreos-installer[812]: Checking that no-one is using this disk right now ... FAILED
[   77.123679] coreos-installer[812]: This disk is currently in use - repartitioning is probably a bad idea.
[   77.124350] coreos-installer[812]: Umount all file systems, and swapoff all swap partitions on this disk.
[   77.133128] coreos-installer[812]: Use the --no-reread flag to suppress this check.
[   77.135027] coreos-installer[812]: sfdisk: Use the --force flag to overrule all checks.
[   77.153159] coreos-installer[812]: dd: /dev/vda: cannot seek: Invalid argument
[   77.158735] coreos-installer[812]: failed to write image to zFCP SCSI disk
Usage: /usr/libexec/coreos-installer [options]

should we have a separate disk for this?

@bgilbert
Copy link
Contributor

@Prashanth684 Maybe I'm missing something; is that relevant to this issue?

I agree that if we hard-require a userdata device we can avoid the race condition. It's a tempting idea, but I'm not 100% sold yet that it's the right approach.

@jlebon Did you have a chance to look at PCI devices at all?

@Prashanth684
Copy link
Contributor

Prashanth684 commented Feb 28, 2020

@Prashanth684 Maybe I'm missing something; is that relevant to this issue?

Ah..correct ..this use case doesn't matter here ..my bad...thanks for clarifying.

@darkmuggle
Copy link
Contributor

From my vantage point -- I like the idea of a disk simply because the model is easy to implement. Its:

  • cross platform
  • cross distro

For some distributions and even versions of Qemu, the firmware might be missing (kernel support). IMHO, this generalizes the requirements more.

@lucab
Copy link
Contributor

lucab commented Feb 29, 2020

@jlebon thanks for summing up the whole topic!

Bunch of random comments:

  • the virtio-serial path is IMHO not nice to port/automate due to the shim requirement.
  • the "always wait for disk" doesn't sound like a bad idea. Though, we still need to cap and time-out in order to fail the service-unit and proceed to the emergency target.
  • this is not very different from the current virtualbox provider, which however uses a GPT disk. See providers/virtualbox: investigate using GuestProperties #629 for more.
  • I'm unsure if the "raw JSON on disk" is going to confuse other things in the guest when scanning for disks/partitions

@bgilbert
Copy link
Contributor

bgilbert commented Mar 2, 2020

the "always wait for disk" doesn't sound like a bad idea. Though, we still need to cap and time-out in order to fail the service-unit and proceed to the emergency target.

The timeout is what we're trying to avoid, though. In principle, probing all disks in the system can take an arbitrary amount of time and no useful timeout value is safe.

I'm unsure if the "raw JSON on disk" is going to confuse other things in the guest when scanning for disks/partitions

I like the GPT approach of the VirtualBox provider for that reason, but it does require additional host-side scripting to implement.

@berrange
Copy link

berrange commented Mar 2, 2020

the "always wait for disk" doesn't sound like a bad idea. Though, we still need to cap and time-out in order to fail the service-unit and proceed to the emergency target.

The timeout is what we're trying to avoid, though. In principle, probing all disks in the system can take an arbitrary amount of time and no useful timeout value is safe.

Perhaps you can you rely on the timeout for the general fully cross-arch portable case, but have a side channel to let you optimize specific cases. eg a kernel command line arg can tell you it definitely exists, and thus should be waited for with no timeout, and/or on x86 & aarch64 an SMBIOS field can tell you it definitely exists. This is essentially the approach that cloud-init takes - it looks at some well known defaults, but a kernel command line arg can give it an explicit place to look.

I'm unsure if the "raw JSON on disk" is going to confuse other things in the guest when scanning for disks/partitions

I like the GPT approach of the VirtualBox provider for that reason, but it does require additional host-side scripting to implement.

To identify which disk to use, you'll need some identifier, which comes down to a choice between disk serial string, or filesystem uuid or filesystem label. If you want raw JSON on disk, then disk serial is your required identifier.

@crawford
Copy link
Contributor

crawford commented Mar 5, 2020

I don't like the idea of always requiring an Ignition config because that immediately rules out using an OS with Ignition for hardware discovery (which is a very important aspect of bare metal provisioning).

Have we ruled out "fixing" QEMU? Most commercial hypervisors provide a host-guest bridge for passing data back and forth.

@Prashanth684
Copy link
Contributor

Prashanth684 commented Mar 5, 2020

Have we ruled out "fixing" QEMU? Most commercial hypervisors provide a host-guest bridge for passing data back and forth.

I did have a chat with some QEMU folks in IBM and it looks like they have to refactor certain drivers and define a new transport to achieve this. I have captured this in a BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1810678

@berrange
Copy link

berrange commented Mar 5, 2020

Have we ruled out "fixing" QEMU? Most commercial hypervisors provide a host-guest bridge for passing data back and forth.

QEMU provides multiple host-guest bridges for passing data back and forth. There is virtio-serial and virtio-vsock. The vsock device actually comes from VMWare - QEMU merely added a virtio transport for VMWare's existing device type here. The problem with any type of device like this is that it will require waiting for the guest OS to probe and hardware, initialize the device and expose it to userspace. Any other device QEMU might implement is going to have similar issues in the guest SO with device initialization.

If ignition isn't able to wait for these device init actions to take place, then the only option left is to rely on well defined memory regions initialized by the firmware such as fw_cfg and SMBIOS, none of which have 100% platform portability.

@jlebon
Copy link
Member Author

jlebon commented Mar 5, 2020

I don't like the idea of always requiring an Ignition config because that immediately rules out using an OS with Ignition for hardware discovery (which is a very important aspect of bare metal provisioning).

Can you expand on this? Do you mean automatically figuring out which platform we're running on?

But again, note that this change isn't making the Ignition config required. It's only making the medium on which to check for a config required. I know this seems like a silly distinction, but I think it's important.

Have we ruled out "fixing" QEMU? Most commercial hypervisors provide a host-guest bridge for passing data back and forth.

There are multiple bridges available. However, the subset of those which aren't a pain to use is much smaller. (See some of the criteria in #928 (comment)).

@crawford
Copy link
Contributor

crawford commented Mar 5, 2020

The problem with any type of device like this is that it will require waiting for the guest OS to probe and hardware, initialize the device and expose it to userspace.

But these devices aren't generally hot plugged, right? We just need to wait for the kernel and udev to iterate over the devices at least one time. This is quite different from a CDROM, for example, since some platforms (and maybe users) dynamically add and remove media. In that case, we just don't know how long we need to wait.

I don't like the idea of always requiring an Ignition config because that immediately rules out using an OS with Ignition for hardware discovery (which is a very important aspect of bare metal provisioning).

Can you expand on this? Do you mean automatically figuring out which platform we're running on?

The use case here is one where a customer is using an image without an Ignition config, with the intent of discovering the hardware (e.g. the name of the network interfaces, disks, etc.).

But again, note that this change isn't making the Ignition config required. It's only making the medium on which to check for a config required. I know this seems like a silly distinction, but I think it's important.

Not silly at all. In fact, I'm not sure I fully understand the distinction. Are you saying that we would potentially require the media to be plugged, even if it doesn't contain a config (so that we may be able to confidently say "there is no config now and there will never be one")? If so, I think this is a convenient model, as long as the UX is reasonable. If I have to go out of my way to provide empty media to a machine, that's not much easier than providing a no-op config.

@jlebon
Copy link
Member Author

jlebon commented Mar 5, 2020

But these devices aren't generally hot plugged, right? We just need to wait for the kernel and udev to iterate over the devices at least one time.

And how does one know if the kernel finished iterating over the devices once? :) udevadm --settle for example will just wait until events are handled. That's AFAIK totally separate from whether e.g. a kernel driver is just being slow to probe a device.

The use case here is one where a customer is using an image without an Ignition config, with the intent of discovering the hardware (e.g. the name of the network interfaces, disks, etc.).

Ahh OK. Yeah, this issue doesn't change that for bare metal machines. It's only scoped to QEMU machines (though even on bare metal, now that we don't have coreos.autologin, how do we expect this workflow to work without providing an Ignition config?).

Are you saying that we would potentially require the media to be plugged, even if it doesn't contain a config (so that we may be able to confidently say "there is no config now and there will never be one")? If so, I think this is a convenient model, as long as the UX is reasonable.

Yup, exactly!

If I have to go out of my way to provide empty media to a machine, that's not much easier than providing a no-op config.

Right, that's a tricky part. With a CD-ROM for example, there's ENOMEDIUM. For a block device, I think the closest is just using /dev/null as the source. (But yes, you'd still have to type e.g. --disk path=/dev/null,format=raw,readonly=on,serial=ignition).

But again, note that on QEMU at least, there is no other way to provision machines anyway. So the use case for not providing an Ignition config is pretty small.

@jlebon
Copy link
Member Author

jlebon commented Mar 5, 2020

@jlebon Did you have a chance to look at PCI devices at all?

I looked a bit, but didn't turn up much. So you're looking for a generic PCI device with e.g. an Ignition-specific device ID which somehow adds some I/O interface through which to read the Ignition config?

the "always wait for disk" doesn't sound like a bad idea. Though, we still need to cap and time-out in order to fail the service-unit and proceed to the emergency target.

The timeout is what we're trying to avoid, though. In principle, probing all disks in the system can take an arbitrary amount of time and no useful timeout value is safe.

The key difference though is that the timeout we're trying to avoid is about "no config provided vs config provided". It could still be worth having a "something is obviously wrong" timeout with a much longer value than would be reasonable to expect from how long it'd take a QEMU bootup to find devices. This gives the chance for sysadmins to probe around and troubleshoot things. Is 5m unreasonable?

I'm unsure if the "raw JSON on disk" is going to confuse other things in the guest when scanning for disks/partitions

I like the GPT approach of the VirtualBox provider for that reason, but it does require additional host-side scripting to implement.

Yeah, I agree it's not the most elegant thing. For FCOS, this shows up in dmesg:

[    2.709843] Dev vdb: unable to read RDB block 1
[    2.710525]  vdb: unable to read partition table
[    2.711467] vdb: partition table beyond EOD, truncated

But otherwise, it works just fine. Overall, feels like it's still better than shipping a cross-platform tool that users have to run to create a config ISO?

@berrange
Copy link

berrange commented Mar 6, 2020

@jlebon Did you have a chance to look at PCI devices at all?

I looked a bit, but didn't turn up much. So you're looking for a generic PCI device with e.g. an Ignition-specific device ID which somehow adds some I/O interface through which to read the Ignition config?

If by "I/O interface" you mean something RPC-like, then virtio-serial or virtio-vsock fit, but they require a daemon on the host to actually provide the data which is a bit tedious if you're just providing a single static data file.

A hard disk is appealing because it trivially exposes a raw data file.

Or a type of PCI device that exposes a memory region - possibly virtio-pmem, though I'm not sure how you'd identify the memory region from userspace.

A further idea that's been suggested is to abuse the PCI option ROM feature, essentially provide a PCI device that implements no functionality, but use the option ROM to expose a static data blob. It is not clear if this works on s390x, but it would work on x86, ppc & arm with PCI.

But these devices aren't generally hot plugged, right? We just need to wait for the kernel and udev to iterate over the devices at least one time.

And how does one know if the kernel finished iterating over the devices once? :) udevadm --settle for example will just wait until events are handled. That's AFAIK totally separate from whether e.g. a kernel driver is just being slow to probe a device.

There's two different waiting stages - the PCI probing, and the userspace device setup. I think you can reasonably assume the kernel has probed all cold booted PCI devices by the time the initrd is executing. What takes longer is then setting up the logical devices associated with them in userspace, such as the /dev/sd* disk nodes, or the network interfaces, of the virtio-serial channels devices.

If ignition can look at the PCI devices present and determine from that alone, whether the right device exists, then it will know whether it is OK to then wait indefinitely for the disk nodes or network interfaces to appear. This would require some unique identifier for the raw PCI device. For example the PCI subsystem product ID could possibly be (ab)used for this purpose

@jlebon
Copy link
Member Author

jlebon commented Mar 6, 2020

A further idea that's been suggested is to abuse the PCI option ROM feature, essentially provide a PCI device that implements no functionality, but use the option ROM to expose a static data blob.
...
This would require some unique identifier for the raw PCI device. For example the PCI subsystem product ID could possibly be (ab)used for this purpose

I think I'm missing something fundamental here: how does one provide a custom PCI device? Doesn't that require rebuilding QEMU? Or is there some sort of drop-in thing it supports?

A custom QEMU device sounds like a cool idea, but if it requires shipping additional blobs that users have to download and install just to run FCOS, I don't think it's worth it.

There's two different waiting stages - the PCI probing, and the userspace device setup. I think you can reasonably assume the kernel has probed all cold booted PCI devices by the time the initrd is executing. What takes longer is then setting up the logical devices associated with them in userspace, such as the /dev/sd* disk nodes, or the network interfaces, of the virtio-serial channels devices.

Thanks, that's useful information. When you say "userspace device setup", are you referring strictly to udev, or is there also a delay for things to show up in sysfs?

jlebon added a commit to jlebon/ignition that referenced this issue Mar 6, 2020
Add experimental support for fetching Ignition configs via a virtio
block device with serial/ID `ignition`.

The main advantage of this is that it is cross-platform. But for now, we
only use it on platforms which don't support the QEMU firmware config
approach, which are (of those we care about) s390x and ppc64le. We may
end up using it across the remaining platforms.

See related discussions in:
coreos#928
@jlebon
Copy link
Member Author

jlebon commented Mar 6, 2020

So here's a suggestion: we're currently in dire need of getting something to work for better OCP CI coverage on s390x. It doesn't have to be stabilized and we don't expect production use of whatever mechanism we choose.

Thus, I think this is a good opportunity to try out this approach to see how it looks and get feedback from it. I've updated #905 for this. I've tested it successfully on amd64 (by tweaking the conditional build flags).

jlebon added a commit to jlebon/ignition that referenced this issue Mar 6, 2020
Add experimental support for fetching Ignition configs via a virtio
block device with serial/ID `ignition`.

The main advantage of this is that it is cross-platform. But for now, we
only use it on platforms which don't support the QEMU firmware config
approach, which are (of those we care about) s390x and ppc64le. We may
end up using it across the remaining platforms.

See related discussions in:
coreos#928
jlebon added a commit to jlebon/ignition that referenced this issue Mar 6, 2020
Add experimental support for fetching Ignition configs via a virtio
block device with serial/ID `ignition`.

The main advantage of this is that it is cross-platform. But for now, we
only use it on platforms which don't support the QEMU firmware config
approach, which are (of those we care about) s390x and ppc64le. We may
end up using it across the remaining platforms.

See related discussions in:
coreos#928
@berrange
Copy link

berrange commented Mar 6, 2020

I think I'm missing something fundamental here: how does one provide a custom PCI device? Doesn't that require rebuilding QEMU? Or is there some sort of drop-in thing it supports?

A custom QEMU device sounds like a cool idea, but if it requires shipping additional blobs that users have to download and install just to run FCOS, I don't think it's worth it.

Yes, this would be brand new code that would have to be written for QEMU, so wouldn't work for any existing deployments of QEMU. I can understand if that is likely to be a practical roadblock

Thanks, that's useful information. When you say "userspace device setup", are you referring strictly to udev, or is there also a delay for things to show up in sysfs?

Mostly I think it'll be udev related, but possibly kernel related. eg the kernel probes PCI devices, and they get attached to kernel drivers. Those kernel drivers then have to probe and initialize the devices. eg a virtio-blk PCI device needs get the virtio-blk block driver attached, which will then create the /sys/block/NNN entry, and which then emit events for udev. So its actually three stages really.

@jlebon
Copy link
Member Author

jlebon commented Mar 6, 2020

Mostly I think it'll be udev related, but possibly kernel related. eg the kernel probes PCI devices, and they get attached to kernel drivers. Those kernel drivers then have to probe and initialize the devices. eg a virtio-blk PCI device needs get the virtio-blk block driver attached, which will then create the /sys/block/NNN entry, and which then emit events for udev. So its actually three stages really.

Thanks, I read some more driver code and that's my understanding as well. So IOW, even if we make the assumption that the kernel finished registering all the PCI devices by the time we run, (1) the drivers might not be loaded yet, (2) the drivers might not have been assigned to their PCI devices yet (though it seems like this happens synchronously with (1) reading the code for driver_register), or (3) the drivers might not have finished probing the devices yet.

Prashanth684 added a commit to Prashanth684/cluster-api-provider-libvirt that referenced this issue Mar 30, 2020
…ignition for s390x/ppc64le

Similar to dmacvicar/terraform-provider-libvirt#718

The method of mimicking what Openstack does for injecting ignition config works for
images which have the provider as Openstack because ignition recognizes the platform
and knows it has to get the ignition config from the config drive. For QEMU images, ignition
supports getting the config from the firmware config device which is not supported by ppc64
and s390x.

The workaround we have used thus far is to use the Openstack image on the QEMU platform but have this provider create the iso containing the ignition config. There was a discussion in ignition
(coreos/ignition#928) to have a more QEMU based method of injecting ignition config and it was decided to use a virtio-blk device with a serial of ignition which ignition can recognize. This was mainly because with external devices, it is hard to tell if there is an issue with the device or if the kernel has not detected it yet if it has a long discovery phase.

This PR replaces the the method of ignition injection from a conig drive disk to a virtio-blk device which is specified through the QEMU command line options.

Reference PR which supports ignition fetching through virtio-blk for QEMU: coreos/ignition#936
Prashanth684 added a commit to Prashanth684/cluster-api-provider-libvirt that referenced this issue Mar 30, 2020
…ignition for s390x/ppc64le

Similar to dmacvicar/terraform-provider-libvirt#718

The method of mimicking what Openstack does for injecting ignition config works for
images which have the provider as Openstack because ignition recognizes the platform
and knows it has to get the ignition config from the config drive. For QEMU images, ignition
supports getting the config from the firmware config device which is not supported by ppc64
and s390x.

The workaround we have used thus far is to use the Openstack image on the QEMU platform but have this provider create the iso containing the ignition config. There was a discussion in ignition
(coreos/ignition#928) to have a more QEMU based method of injecting ignition config and it was decided to use a virtio-blk device with a serial of ignition which ignition can recognize. This was mainly because with external devices, it is hard to tell if there is an issue with the device or if the kernel has not detected it yet if it has a long discovery phase.

This PR replaces the the method of ignition injection from a conig drive disk to a virtio-blk device.

Reference PR which supports ignition fetching through virtio-blk for QEMU: coreos/ignition#936
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/cluster-api-provider-libvirt that referenced this issue Apr 28, 2020
…90x/ppc64le

Similar to dmacvicar/terraform-provider-libvirt#718

The method of mimicking what Openstack does for injecting ignition config works for
images which have the provider as Openstack because ignition recognizes the platform
and knows it has to get the ignition config from the config drive. For QEMU images, ignition
supports getting the config from the firmware config device which is not supported by ppc64
and s390x.

The workaround we have used thus far is to use the Openstack image on the QEMU platform but have this provider create the iso containing the ignition config. There was a discussion in ignition
(coreos/ignition#928) to have a more QEMU based method of injecting ignition config and it was decided to use a virtio-blk device with a serial of ignition which ignition can recognize. This was mainly because with external devices, it is hard to tell if there is an issue with the device or if the kernel has not detected it yet if it has a long discovery phase.

This PR reverts the ISO method used by Openstack and just creates a virtio-blk device through the QEMU command line options.

Reference PR which supports ignition fetching through virtio-blk for QEMU: coreos/ignition#936
@jlebon
Copy link
Member Author

jlebon commented Jun 8, 2020

So the new Ignition block device approach is now being used with success in RHCOS.

I'm thinking we should take the next step here towards having it used across all arches and eventually deprecating fw_cfg. Here's what I have in mind:

  1. We change Ignition on fw_cfg arches to first look for fw_cfg but allow ENOENT and automatically falling back to the block device.
  2. Change all our tools/CI processes we control to leverage the block device approach. This further validates the approach.
  3. Change all documentation to mention the block device only, and drop mention of fw_cfg.
  4. After some time, mark fw_cfg as deprecated, emitting a warning (and maybe e.g. sleep(10)) when it is detected.
  5. After some more time, we drop fw_cfg support entirely.

jlebon added a commit to jlebon/ignition that referenced this issue Jun 11, 2020
We've had some good success with the new block device approach on s390x
and ppc64le. Let's now make it available on all arches. This is the
first step towards eventually deprecating `fw_cfg`:

coreos#928 (comment)

Early adopters will benefit from simplified code which works across all
architectures, as well as better compatibility with other virtualization
tooling like libvirt, which today doesn't expose `fw_cfg` easily (though
there's work in flight for that).
@jlebon
Copy link
Member Author

jlebon commented Jun 11, 2020

  1. We change Ignition on fw_cfg arches to first look for fw_cfg but allow ENOENT and automatically falling back to the block device.

#999.

@cgwalters
Copy link
Member

I don't like the idea of always requiring an Ignition config because that immediately rules out using an OS with Ignition for hardware discovery (which is a very important aspect of bare metal provisioning).

Isn't that addressed by us providing a Live ISO which explicitly does not require a config and even automatically falls open to an auto-login shell?

We're shipping that today for OCP/RHCOS for this exact use case of hardware discovery.

That said, in practice these hardware discovery flows actually do want to be automated and so the assisted flow injects Ignition into the ISO.

But the point remains that the ISO serves this "no config" use case.

@bgilbert
Copy link
Contributor

bgilbert commented May 5, 2022

I don't like the idea of always requiring an Ignition config because that immediately rules out using an OS with Ignition for hardware discovery (which is a very important aspect of bare metal provisioning).

Isn't that addressed by us providing a Live ISO which explicitly does not require a config and even automatically falls open to an auto-login shell?

No, because that flow is specific to FCOS/RHCOS. Other distros that use Ignition may have a different usage model.

@cgwalters
Copy link
Member

FCOS does not have an exclusive patent on creating ISOs 😄 Anyone can do it, why wouldn't we ("we" in the Ignition upstream sense) recommend doing that type of thing? Or really the generalization of this is a separate image (of any form, but really this is about bare metal so ISO makes the most sense).

@bgilbert
Copy link
Contributor

bgilbert commented May 5, 2022

Ignition is useful for a wider range of use cases than just a traditional server distro where a live image makes sense. E.g. embedded cases will have different constraints. (And as a practical matter, Ignition is already waaaay too difficult to integrate into a distro, and "you really should have a live ISO" is a pretty large ask. It took an enormous amount of work to build our live implementation.)

We should be careful not to view Ignition too parochially through the lens of CoreOS's needs. Ignition has always allowed users to boot without an Ignition config on every platform, it's a useful feature, and changing it shouldn't be done lightly.

@cgwalters
Copy link
Member

Fair. Though, what about supporting a file like /etc/ignition-block-on-config (could be JSON/YAML/keyfile/whatever config file) that can be injected into the initramfs that defaults to no but can be toggled to timeout: 1m or race e.g. So anyone making a Linux system that uses this and wants to have one image (AMI/qcow2/ISO/whatever) that blocks on config and one that doesn't can just drop a config file into the initramfs for one of the images?

Another approach: Support an easily editable outside the disk mechanism to inject config beyond just ISOs. Maybe something like a GPT flag or so - something that can be done completely unprivileged using e.g. sfdisk or whatever on a raw disk image without actually spawning something like libguestfs. Much like coreos-installer iso ignition. Or even not actually the config, a single bit flag for whether or not to fetch a config via virtio-blk, defaulting to on or off per distro. But such a mechanism would then make it easy to have a single .qcow2 and flip off the requirement for Ignition.

(Though to make this actually useful, the OS/distro does need to detect this at a higher level and e.g. do autologin on the tty or so beyond Ignition, or I guess Ignition could auto-include a distro config to do this if the flag is detected)

@cgwalters
Copy link
Member

OK hopefully everyone agrees that having qemu semantics be architecture-specific is a problem. This issue came up again because a test started failing only on s390x. We caught it in the development stream, but still, as of now it would block shipping OS updates on the affected platforms, but not x86_64 which is obviously a problem. We should strive to minimize unnecessary architecture specifics.

So there are two paths: support something like fw_cfg on all platforms which was discussed, or support making qemu Ignition mandatory on all platforms by default.

Now I will strive to keep somewhat separate the upstream Ignition from CoreOS in this conceptually...but I think most other users of Ignition are basically using it in very similar ways anyways.

I will admit that I have in the past provided no Ignition to e.g. a FCOS AWS instance, and relied on afterburn ssh key to get a shell.

But outside of cases that have some other cloud-specific OOB mechanism to configure some sort of basic login or special images like the ISO that have defined useful default Ignition...I can't think of use cases for "no Ignition". Specifically the qemu image we ship in FCOS and derivatives.

So short term, I'd advocate for us changing qemu across all platforms to require Ignition (and try virtio first, falling back to fw_cfg on x86_64).

Ultimately I think qemu is a weird special case. Don't get me wrong, I am of two minds on this 😄 I think what we've done CoreOS side in heavily investing in having "basic sanity checking" using qemu makes a lot of sense. But OTOH - most real "qemu" uses should actually be using a real hypervisor wrapping it (whether libvirt or openstack, etc.). This brings a lot more power and capability. And OpenStack specifically then has the metadata service. It's actually doable to run a metadata service in libvirt even.

In OpenShift development there is some support for the CoreOS qemu model, but in practice it has bitrotted and most people who are doing virtualized work are trying to do so to emulate bare metal installation, so the basically everyone doing real work like that has switched to using libvirt for other reasons (much more control over networking, can simulate real bare metal ISOs etc.)

Long term, I think it would be really nice still to somehow support "race free optional metadata" for qemu. It's quite surprising it's so hard. I think the most viable path is making it easy to inject metadata into a qcow2, and drive that support into e.g. qemu-img, much like what we have with coreos-installer iso ignition embed.

@cgwalters
Copy link
Member

One thing that came up today is that systemd grew code to read the fw_cfg too: https://github.com/systemd/systemd/blob/fff1edc9f9069ff2ef58a714355b61002e30f305/docs/CREDENTIALS.md?plain=1#L53

So there's potentially a stronger argument to generalize support for fw_cfg or equivalent in qemu across architectures as there are multiple readers.

@berrange
Copy link

One thing that came up today is that systemd grew code to read the fw_cfg too: https://github.com/systemd/systemd/blob/fff1edc9f9069ff2ef58a714355b61002e30f305/docs/CREDENTIALS.md?plain=1#L53

So there's potentially a stronger argument to generalize support for fw_cfg or equivalent in qemu across architectures as there are multiple readers.

On the QEMU side, the view remains that apps should be using OEM strings for user data injection to the OS, not fw_cfg. systemd supports both, but neither are a portable solution across architectures. I thought ignition had support for using a virtual disk as a portable solution across architectures ?

With the introduction of confidential virtualization, this likely gets more complex still, because neither fw_cfg/OEM strings are trustworthy sources of information. They can also only be populated when the QEMU process is first started, and it is undesirable to release potentially sensitive data upfront. It is possible that initial boot customization data is going to need to be fetched early in boot from a remote attestation service, in response to submitting an attestation report of the VM to prove its confidentiality. This might be a direct network connection from the guest to the attestation service, or it could be a connection that is mediated by an opaque proxy in the host, as yet undetermined.

@bgilbert
Copy link
Contributor

Neither OEM strings nor fw_cfg are supported on every relevant CPU architecture. Either one could work, but we'd need it to be generalized to work everywhere.

There is experimental support for using a virtual disk, but we can't stabilize it as-is. As previously discussed, there's an inherent race condition between storage device probing (which can be arbitrarily slow on a large/heavily loaded system) and the timeout that would be needed to successfully boot if the user doesn't provide a config.

What makes those two information sources untrustworthy? I agree that it's undesirable to include sensitive information in them, but they would be useful for configuring the URL to an attestation service.

@berrange
Copy link

Neither OEM strings nor fw_cfg are supported on every relevant CPU architecture. Either one could work, but we'd need it to be generalized to work everywhere.

Never say never, but my feeling it is is pretty unlikely either will become supported on every arch.

What makes those two information sources untrustworthy? I agree that it's undesirable to include sensitive information in them, but they would be useful for configuring the URL to an attestation service.

This is quite a complex topic, but I'll try to give a overview

With confidential virtualization, the host/hypervisor administrator / OS is considered untrusted. Only the physical hardware / its vendor is trusted, and you can prove that via attestation of the VM execution environment. The implication is that any data that is mediated by the host also be considered untrusted. ie you can ask the hypervisor expose a given block or data, but it can substitute its own data instead, so as to try to compromise the VM.

These extends to essentially every piece of virtualized hardware in the VM. Each device or data channel needs to have a defined way to apply encryption before it can be trusted for us. For network of course TLS is already widely used. For disks, LUKS can be used, though many cipher modes are significantly degraded if the attacker can observe repeated I/O to the same sector. In the case of LUKS, we need to have a secure way to get the keyslot passpharse. The admin can't ever interactively type it on the virtual keyboard as key presses can be logged by the hypervisor. The approach Azure (and likely KVM) take is to provide a virtual TPM running inside guest context against which the LUKS passphrase was previously sealed, and the TPM state is unlocked by talking to an attestation server early in boot.

So lets say we did want to keep using fw_cfg/SMBIOS in a confidential VM. The data provided that way would need to be encrypted. Then there would need to be another mechanism to acquire the decryption key, either by directly talking to an attestation server, or perhaps by leverging a vTPM like we'll probably do for LUKS. Alternatively the ignition (or cloud-init or systemd creds) data can be handled by the attestation service directly avoiding the use of fwcfg/SMBIOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira
Projects
None yet
Development

No branches or pull requests

8 participants