Support mirroring /boot, ESP, BIOS bootloader on first boot #718

bgilbert · 2020-10-30T06:20:52Z

Supporting redundant bootable disks for coreos/fedora-coreos-tracker#581. This change supports the following:

Moving /boot and /boot/efi to RAID 1 volumes if the Ignition config has a filesystem with a boot or EFI-SYSTEM label and wipe_filesystem: true, similar to how we move the contents of the root filesystem. Because BIOS GRUB is configured to set prefix to the first disk, it must not have MD-RAID support preloaded (it currently does not). The MD-RAID superblocks must be at the end of the component partitions (superblock format 1.0) so BIOS GRUB resp. the UEFI firmware can treat /boot resp. /boot/efi as normal filesystems.
Copying the BIOS-BOOT partition bits and corresponding boot sector if the Ignition config creates partitions with the requisite type GUIDs. We don't RAID these because they're not modified by the installed system. (bootupd thus needs to support multiple independent disks.)
Copying the ppc64le PReP partition bits if the Ignition config creates partitions with the requisite type GUIDs. We likewise don't RAID these.

Design document in coreos/enhancements#3. This functionality can actually be used to completely repartition the boot disk, since we copy everything into RAM before the Ignition disks stage. The only requirement is that BIOS-BOOT starts at the same offset (which we check for). The corresponding FCC sugar is in coreos/butane#162.

Test with:

qemu-img create -f qcow2 second.qcow2 8G
qemu-system-x86_64 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -m 4096 -accel kvm -object rng-random,filename=/dev/urandom,id=rng0 -netdev user,id=eth0,hostfwd=tcp::2222-:22,hostname="fcos" -device virtio-net-pci,netdev=eth0 -fw_cfg name=opt/com.coreos/config,file="$(pwd)/example.ign" -drive if=virtio,file=./fedora*.qcow2 -drive if=virtio,file=./second.qcow2

Drop -bios /usr/share/edk2/ovmf/OVMF_CODE.fd to test in BIOS mode. Drop -drive if=virtio,file=./fedora*.qcow2 to test a failure of the first drive.

Use the following FCC (assumes coreos/butane#162):

variant: fcos
version: 1.3.0-experimental
boot_device:
  mirror:
    devices: [/dev/vda, /dev/vdb]

jlebon · 2020-10-30T16:32:26Z

Do you have any concerns about the complexity of the config needed for this? It looks like a lot of things that could be mistyped. :) And even if we have FCC sugar for it, in the end we're still on the hook for maintaining that interface at the Ignition level.

I wonder if we should instead just key off of something simpler in the config. For example, we can check if the config is trying to mirror the rootfs on RAID1. E.g. the logic could be: "if the user wants to make the root partition from the primary boot disk part of a RAID1, then assume that they want full disk RAID1". Because the use case for putting just the rootfs on RAID1 while leaving out the other partitions seems dubious.

Simplifying the interface also means we get more flexibility over how it's done, and we can ensure better consistency of state when thinking about upgrades.

bgilbert · 2020-10-30T17:37:30Z

No, not at all. Ignition is designed as a pretty low-level interface, with no magic in the spec and hopefully very little magic in the surrounding glue. All of the inferences made by the glue logic are a small logical leap: if you make a boot filesystem you probably want it copied over; if you make an ESP or BIOS-BOOT partition you probably want it copied. The largest inference is that we should copy the boot sector whenever we're copying the BIOS-BOOT partition, but I think that follows pretty naturally from how BIOS booting works.

We'll sugar this down to a couple lines of FCC for the primary use case, and OCP can declare unsupported anything that doesn't use the official sugar. But a benefit of this approach is that there's no narrowly-scoped magic "peephole optimization", as it were. An FCOS user who wants to do something clever (boot RAID 1 + root RAID 5 or whatever) retains the full ability to do that, since the config just specifies the desired disk layout in the natural way.

bgilbert · 2020-11-03T07:41:19Z

Proposal in coreos/enhancements#3.

cgwalters

Skimmed; seems sane.

Maybe in the future we reimplement this in rdcore but seems OK for now.

overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh

bgilbert · 2020-11-25T18:23:41Z

Ready for review!

jlebon

Some minor comments, but LGTM overall! Did you successfully test this in a 4Kn RAID setup as well?

overlay.d/05core/usr/lib/dracut/modules.d/40ignition-ostree/ignition-ostree-transposefs.sh

We'll be generalizing the rootfs save/restore code to support saving and restoring other partitions. Generalize the name to "transposefs" and move the saved rootfs data to /run/ignition-ostree-transposefs/root.

Add function to generate a partial jq query string to find wiped filesystems.

If the Ignition config creates any BIOS Boot partitions, save the existing BIOS-BOOT contents and the corresponding boot sector before ignition-disks (in case the partition is overwritten) and copy them to the new partitions (and corresponding disks) afterward. Also verify that the offset of the new BIOS-BOOT partitions matches the old one, since otherwise GRUB will fail when it tries to use them. We don't require the config to create a BIOS-BOOT RAID array because the OS doesn't use or modify the BIOS-BOOT partition at runtime.

If the Ignition config creates any PowerPC PReP partitions, save the existing PReP contents before ignition-disks (in case the partition is overwritten) and copy them to the new partitions afterward. We don't require the config to create a PReP RAID array because the OS doesn't use or modify the PReP partition at runtime.

bgilbert · 2020-12-04T10:58:15Z

Updated, and tested on 4Kn.

jlebon

🎉

cmurf · 2020-12-05T00:11:13Z

The MD-RAID superblocks must be at the end of the component partitions (superblock format 1.0) so BIOS GRUB resp. the UEFI firmware can treat /boot resp. /boot/efi as normal filesystems.

This has come up a few times on linux-raid@ and upstream developers have consistently been critical of the idea of using mdadm, any metadata version, for use with the EFI System partition. There's no guarantee the firmware itself won't write to the ESP, including by some other EFI program. In this case, the raid will become broken and it's not repairable. I think keeping ESPs synchronized should be the responsibility of something like bootupd. An alternative might be firmware RAID, which mdadm also supports.

As for $BOOT volume, upstream GRUB puts grubenv here. Fedora it's here for BIOS systems, where UEFI systems put it on the ESP. The grubenv is used in Fedora for the GRUB hidden menu feature. If GRUB knows grubenv is on md raid (or Btrfs) it will refuse to write to it, to avoid causing an inconsistent state. But if GRUB doesn't know it's md raid, it'll permit writes to grubenv. This probably isn't that bad of an inconsistency, because the way GRUB writes to grubenv is just by overwriting only the two 512 blocks making up grubenv, there's no fs metadata update at all. The bigger concern with metadata 1.0 has always been that it invites inadvertent mounting of the member device, rather than the array device. Once this happens, again the raid is broken and it's not reversible or repairable.

Because BIOS GRUB is configured to set prefix to the first disk, it must not have MD-RAID support preloaded

If GRUB doesn't know $BOOT is an mdadm device, how does fallback work when there's a read error or the device is missing? I'm expecting the point of going to the trouble of making $BOOT raid1, is if there's a problem with a member device, the bootloader can automatically use the other, and thus still boot the system. That's built into the GRUB mdraid1x.mod.

I think it's better to make the prefix the md device, and expect GRUB knows the true nature of the stack which is that $BOOT is an array device. And then use mdadm metadata version 1.2 which is both the recommended and default version, because it prevents the inadvertent use of the md member device.

bgilbert · 2020-12-05T00:25:03Z

Thanks for the comments!

There's no guarantee the firmware itself won't write to the ESP, including by some other EFI program. In this case, the raid will become broken and it's not repairable.

An earlier draft proposed to maintain independent ESPs on each disk, but that would make it infeasible to mount "the ESP" inside the OS. Periodic RAID resync would still fix any breakage, no?

As for $BOOT volume, upstream GRUB puts grubenv here. Fedora it's here for BIOS systems, where UEFI systems put it on the ESP. The grubenv is used in Fedora for the GRUB hidden menu feature.

Fedora CoreOS (and RHEL CoreOS) doesn't use that feature. We're reading the grubenv but nothing in our configs writes to it.

If GRUB doesn't know $BOOT is an mdadm device, how does fallback work when there's a read error or the device is missing?

We always boot from the first disk. If the first disk is missing, the second disk becomes the first disk.

bgilbert · 2020-12-05T00:34:46Z

An alternative might be firmware RAID, which mdadm also supports.

Right, but the firmware might not.

cmurf · 2020-12-05T01:14:03Z

An earlier draft proposed to maintain independent ESPs on each disk, but that would make it infeasible to mount "the ESP" inside the OS.

Yep, I understand. This immediately exposes the unfortunate paradigm of having the ESP persistently mounted. It was always a bad idea, but we did it because we didn't have a smarter way of doing it. On Windows and Mac OS, it is never persistently mounted, it is never exposed to the user at all in any way. The thing that "owns" the ESP, for modifications and updates, is responsible for mounting this file system, making changes, then unmounting it. So again, I'd say this is the realm of bootupd, and/or maybe fwupd. And we should stop putting the ESP in fstab.

In the fwupd case, it could do some test: prefer the ESP listed first in NVRAM boot order, mount that ESP and check if it has enough free space, if not try the other one and set a "boot next" NRAM entry. Maybe clean-up/garbage collection needs to go in there somewhere too. For bootupd it might be simpler, have an ESP files list on sysroot, and then use that as the source of authority, and the ESPs are just clones of that and should always be identical (at least, their directory on the ESP should be identical, there may also be a question about BLS directory... which I'll ignore for now).

Periodic RAID resync would still fix any breakage, no?

Nope. It's ambiguous which drive is correct. No checksums. In the degraded array case (legit degraded assembly and still mounting the md array device) the event count in the active mdadm super block for the active md member device is updated. So it is determinable how to scrub and "catch up" the device with the lower event count. But in this example, where writes happen outside the mdadm infrastructure, the mdadm superblocks are still identical. A scrub repair will make things worse, it will actually break both of the ESPs because, without checksums or versions, it just randomly picks a block that's assumed to be correct, and it overwrites the mismatching block. And it's not consistent how it does this.

Fedora CoreOS (and RHEL CoreOS) doesn't use that feature. We're reading the grubenv but nothing in our configs writes to it.

OK it may not be an issue for CoreOS now. But there's a pre-proposal for Silverblue to rebase on CoreOS and right now it does use this feature. There are any number of reasons why grubenv design is suboptimal, so really the solution is to fix that but ... resources.

We always boot from the first disk. If the first disk is missing, the second disk becomes the first disk.

The fallback is built into GRUB's mdraid support. It handles degraded operations.

I'm not sure what "second disk becomes the first" looks like in grub.cfg. But, if it's possible to script the fallback, I'm still skeptical about the handling of a failed drive that isn't missing, rather it just spews zeros or garbage, a failure mode pretty common with SSDs.

bgilbert · 2020-12-20T06:04:24Z

Okay, thanks for pursuing this. I'm seeing three issues:

Firmware might desynchronize a RAIDed ESP. My reading of drivers/md/raid1.c is that a resync will always copy from the first drive to the others, but any reads that happen before the resync might come from any drive, and of course any writes based on those reads are suspect. We could initiate a resync on every boot but I think it'd be hard to demonstrate that no suspect reads could slip in.

I agree we should switch back to the independent-ESP model. 40ignition-ostree: copy ESP contents as independent filesystems #794 implements the OS half and config/fcos/v1_[34]: un-RAID ESP butane#178 implements the FCCT half.
GRUB doesn't know that /boot is a RAID. The design assumed this was okay because any failed drive would be ignored by the firmware, but that's not necessarily the case; the firmware will boot from the first readable drive, but may still enumerate the bad drive and fail any I/O to it. In this case it's useful for GRUB to retry with another replica.

At present, we can fix this fully on UEFI. We can fix it partially on BIOS, but a complete fix would require rerunning grub2-install, which we're not equipped to do yet (implement BIOS (grub) bootupd#53). In turn, that means we can't yet switch to an MD 1.2 superblock. grub: read from md/md-boot if it exists coreos-assembler#1979 switches to GRUB RAID support as much as possible, and Reinstall BIOS GRUB during first boot if /boot RAID is enabled fedora-coreos-tracker#702 proposes reinstalling BIOS GRUB using bootupd.
grubenv writes will desynchronize /boot (before the changes in point 2) or will fail (after the changes). We have a consensus not to use grubenv for FCOS and RHCOS, and Silverblue is out of scope at present, so it appears there's nothing to be done here for now.

cmurf · 2020-12-21T19:12:53Z

a resync will always copy from the first drive to the others

That's less bad than I thought, but I agree ambiguity and risk remain.

This Workstation WG issue starts out about grubenv and /boot on Btrfs, but eventually comes around to drawing on bootupd as a possible way to decouple /boot and /boot/efi from RPM, i.e. a single source of truth on sysroot "how to boot the system". And likewise simplify boot and make it more reliable. My pipe dream is that ESP, BIOS Boot, and boot partitions become the sort of plumbing that isn't user facing at all, whether at install time or repair/replacement time.
https://pagure.io/fedora-workstation/issue/206#comment-706695

cmurf · 2020-12-25T20:45:20Z

Pipe dream cont'd ...
If /etc/fstab is not populated, I guess systemd's gpt-auto-generator can kick in and mount the ESP to /boot or /efi - so I wonder how to deconflict between gpt-auto-generator, bootupd, and fwupd?

In the simple single disk case, only bootupd/fwupd touching either /boot or /efi should mean the ESP gets mounted. Anything else probably should get an selinux denial.

But in the dual ESP case, systemd ignores extra ESPs. It only automounts the ESP on the drive that the bootloader says it used for booting. So now what? (a) teach systemd about multiple ESPs? (b) inhibit systemd ESP automount, and have bootupd/fwupd deal with all of it? (e.g. mount each of them in turn to some location in /run, do what needs to be done, then umount them).

bgilbert · 2020-12-25T22:22:46Z

We don't ship the gpt-auto-generator. The plan is to pursue option (b); see for example coreos/bootupd#127.

ci: use the RHEL 8.5 repos on the mirror

bgilbert mentioned this pull request Oct 30, 2020

create_disk: move /boot to partition 3; move BIOS-BOOT to 1 and always create it coreos/coreos-assembler#1820

Merged

cgwalters approved these changes Nov 4, 2020

View reviewed changes

bgilbert mentioned this pull request Nov 10, 2020

Move /boot filesystem to partition 3 coreos/fedora-coreos-tracker#669

Closed

sohankunkerkar mentioned this pull request Nov 18, 2020

kola/tests/misc: add kola test to validate the boot-mirror RAID1 coreos/coreos-assembler#1880

Merged

bgilbert changed the title ~~WIP: Support mirroring /boot, ESP, BIOS bootloader on first boot~~ Support mirroring /boot, ESP, BIOS bootloader on first boot Nov 25, 2020

bgilbert marked this pull request as ready for review November 25, 2020 18:23

jlebon approved these changes Nov 26, 2020

View reviewed changes

jlebon mentioned this pull request Nov 30, 2020

Bare Metal Provisioning failure (no bootloader installed) coreos/fedora-coreos-tracker#683

Open

bgilbert added 7 commits December 4, 2020 01:16

40ignition-ostree: generalize rootfs naming to transposefs

9663244

We'll be generalizing the rootfs save/restore code to support saving and restoring other partitions. Generalize the name to "transposefs" and move the saved rootfs data to /run/ignition-ostree-transposefs/root.

40ignition-ostree: rename variable

a4608a6

40ignition-ostree: add helper function for querying filesystems

4ff7498

Add function to generate a partial jq query string to find wiped filesystems.

40ignition-ostree: support saving/restoring boot partition

a7ee942

40ignition-ostree: support saving/restoring ESP

a2025b3

jlebon approved these changes Dec 4, 2020

View reviewed changes

bgilbert merged commit 3a2d52c into coreos:testing-devel Dec 4, 2020

bgilbert deleted the raid branch December 4, 2020 18:37

bgilbert mentioned this pull request Dec 4, 2020

metal: Support redundant bootable disks coreos/fedora-coreos-tracker#581

Closed

jlebon mentioned this pull request Dec 4, 2020

switch from tmpfs to zram #768

Merged

bgilbert mentioned this pull request Dec 20, 2020

os: un-RAID ESP; use GRUB RAID module coreos/enhancements#4

Merged

This was referenced Dec 20, 2020

40ignition-ostree: copy ESP contents as independent filesystems #794

Merged

config/fcos/v1_[34]: un-RAID ESP coreos/butane#178

Merged

grub: read from md/md-boot if it exists coreos/coreos-assembler#1979

Merged

jlebon mentioned this pull request Jan 7, 2021

2021-01-06: gather status update for Fedora Council coreos/fedora-coreos-tracker#690

Closed

c4rt0 pushed a commit to c4rt0/fedora-coreos-config that referenced this pull request Mar 27, 2023

Merge pull request coreos#718 from miabbott/ci_rhel85

f6db7a2

ci: use the RHEL 8.5 repos on the mirror

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support mirroring /boot, ESP, BIOS bootloader on first boot #718

Support mirroring /boot, ESP, BIOS bootloader on first boot #718

bgilbert commented Oct 30, 2020 •

edited

Loading

jlebon commented Oct 30, 2020

bgilbert commented Oct 30, 2020

bgilbert commented Nov 3, 2020

cgwalters left a comment

bgilbert commented Nov 25, 2020

jlebon left a comment

bgilbert commented Dec 4, 2020

jlebon left a comment

cmurf commented Dec 5, 2020

bgilbert commented Dec 5, 2020

bgilbert commented Dec 5, 2020

cmurf commented Dec 5, 2020 •

edited

Loading

bgilbert commented Dec 20, 2020

cmurf commented Dec 21, 2020 •

edited

Loading

cmurf commented Dec 25, 2020

bgilbert commented Dec 25, 2020

Support mirroring /boot, ESP, BIOS bootloader on first boot #718

Support mirroring /boot, ESP, BIOS bootloader on first boot #718

Conversation

bgilbert commented Oct 30, 2020 • edited Loading

jlebon commented Oct 30, 2020

bgilbert commented Oct 30, 2020

bgilbert commented Nov 3, 2020

cgwalters left a comment

Choose a reason for hiding this comment

bgilbert commented Nov 25, 2020

jlebon left a comment

Choose a reason for hiding this comment

bgilbert commented Dec 4, 2020

jlebon left a comment

Choose a reason for hiding this comment

cmurf commented Dec 5, 2020

bgilbert commented Dec 5, 2020

bgilbert commented Dec 5, 2020

cmurf commented Dec 5, 2020 • edited Loading

bgilbert commented Dec 20, 2020

cmurf commented Dec 21, 2020 • edited Loading

cmurf commented Dec 25, 2020

bgilbert commented Dec 25, 2020

bgilbert commented Oct 30, 2020 •

edited

Loading

cmurf commented Dec 5, 2020 •

edited

Loading

cmurf commented Dec 21, 2020 •

edited

Loading