Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support mirroring /boot, ESP, BIOS bootloader on first boot #718

Merged
merged 7 commits into from
Dec 4, 2020
Merged

Support mirroring /boot, ESP, BIOS bootloader on first boot #718

merged 7 commits into from
Dec 4, 2020

Conversation

bgilbert
Copy link
Contributor

@bgilbert bgilbert commented Oct 30, 2020

Supporting redundant bootable disks for coreos/fedora-coreos-tracker#581. This change supports the following:

  • Moving /boot and /boot/efi to RAID 1 volumes if the Ignition config has a filesystem with a boot or EFI-SYSTEM label and wipe_filesystem: true, similar to how we move the contents of the root filesystem. Because BIOS GRUB is configured to set prefix to the first disk, it must not have MD-RAID support preloaded (it currently does not). The MD-RAID superblocks must be at the end of the component partitions (superblock format 1.0) so BIOS GRUB resp. the UEFI firmware can treat /boot resp. /boot/efi as normal filesystems.
  • Copying the BIOS-BOOT partition bits and corresponding boot sector if the Ignition config creates partitions with the requisite type GUIDs. We don't RAID these because they're not modified by the installed system. (bootupd thus needs to support multiple independent disks.)
  • Copying the ppc64le PReP partition bits if the Ignition config creates partitions with the requisite type GUIDs. We likewise don't RAID these.

Design document in coreos/enhancements#3. This functionality can actually be used to completely repartition the boot disk, since we copy everything into RAM before the Ignition disks stage. The only requirement is that BIOS-BOOT starts at the same offset (which we check for). The corresponding FCC sugar is in coreos/butane#162.

Test with:

qemu-img create -f qcow2 second.qcow2 8G
qemu-system-x86_64 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -m 4096 -accel kvm -object rng-random,filename=/dev/urandom,id=rng0 -netdev user,id=eth0,hostfwd=tcp::2222-:22,hostname="fcos" -device virtio-net-pci,netdev=eth0 -fw_cfg name=opt/com.coreos/config,file="$(pwd)/example.ign" -drive if=virtio,file=./fedora*.qcow2 -drive if=virtio,file=./second.qcow2

Drop -bios /usr/share/edk2/ovmf/OVMF_CODE.fd to test in BIOS mode. Drop -drive if=virtio,file=./fedora*.qcow2 to test a failure of the first drive.

Use the following FCC (assumes coreos/butane#162):

variant: fcos
version: 1.3.0-experimental
boot_device:
  mirror:
    devices: [/dev/vda, /dev/vdb]

@jlebon
Copy link
Member

jlebon commented Oct 30, 2020

Do you have any concerns about the complexity of the config needed for this? It looks like a lot of things that could be mistyped. :) And even if we have FCC sugar for it, in the end we're still on the hook for maintaining that interface at the Ignition level.

I wonder if we should instead just key off of something simpler in the config. For example, we can check if the config is trying to mirror the rootfs on RAID1. E.g. the logic could be: "if the user wants to make the root partition from the primary boot disk part of a RAID1, then assume that they want full disk RAID1". Because the use case for putting just the rootfs on RAID1 while leaving out the other partitions seems dubious.

Simplifying the interface also means we get more flexibility over how it's done, and we can ensure better consistency of state when thinking about upgrades.

@bgilbert
Copy link
Contributor Author

No, not at all. Ignition is designed as a pretty low-level interface, with no magic in the spec and hopefully very little magic in the surrounding glue. All of the inferences made by the glue logic are a small logical leap: if you make a boot filesystem you probably want it copied over; if you make an ESP or BIOS-BOOT partition you probably want it copied. The largest inference is that we should copy the boot sector whenever we're copying the BIOS-BOOT partition, but I think that follows pretty naturally from how BIOS booting works.

We'll sugar this down to a couple lines of FCC for the primary use case, and OCP can declare unsupported anything that doesn't use the official sugar. But a benefit of this approach is that there's no narrowly-scoped magic "peephole optimization", as it were. An FCOS user who wants to do something clever (boot RAID 1 + root RAID 5 or whatever) retains the full ability to do that, since the config just specifies the desired disk layout in the natural way.

@bgilbert
Copy link
Contributor Author

bgilbert commented Nov 3, 2020

Proposal in coreos/enhancements#3.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed; seems sane.

Maybe in the future we reimplement this in rdcore but seems OK for now.

@bgilbert bgilbert changed the title WIP: Support mirroring /boot, ESP, BIOS bootloader on first boot Support mirroring /boot, ESP, BIOS bootloader on first boot Nov 25, 2020
@bgilbert bgilbert marked this pull request as ready for review November 25, 2020 18:23
@bgilbert
Copy link
Contributor Author

Ready for review!

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but LGTM overall! Did you successfully test this in a 4Kn RAID setup as well?

We'll be generalizing the rootfs save/restore code to support saving
and restoring other partitions.  Generalize the name to "transposefs"
and move the saved rootfs data to /run/ignition-ostree-transposefs/root.
Add function to generate a partial jq query string to find wiped
filesystems.
If the Ignition config creates any BIOS Boot partitions, save the
existing BIOS-BOOT contents and the corresponding boot sector before
ignition-disks (in case the partition is overwritten) and copy them to
the new partitions (and corresponding disks) afterward.  Also verify that
the offset of the new BIOS-BOOT partitions matches the old one, since
otherwise GRUB will fail when it tries to use them.  We don't require the
config to create a BIOS-BOOT RAID array because the OS doesn't use or
modify the BIOS-BOOT partition at runtime.
If the Ignition config creates any PowerPC PReP partitions, save the
existing PReP contents before ignition-disks (in case the partition is
overwritten) and copy them to the new partitions afterward.  We don't
require the config to create a PReP RAID array because the OS doesn't
use or modify the PReP partition at runtime.
@bgilbert
Copy link
Contributor Author

bgilbert commented Dec 4, 2020

Updated, and tested on 4Kn.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@bgilbert bgilbert merged commit 3a2d52c into coreos:testing-devel Dec 4, 2020
@bgilbert bgilbert deleted the raid branch December 4, 2020 18:37
@jlebon jlebon mentioned this pull request Dec 4, 2020
@cmurf
Copy link

cmurf commented Dec 5, 2020

The MD-RAID superblocks must be at the end of the component partitions (superblock format 1.0) so BIOS GRUB resp. the UEFI firmware can treat /boot resp. /boot/efi as normal filesystems.

This has come up a few times on linux-raid@ and upstream developers have consistently been critical of the idea of using mdadm, any metadata version, for use with the EFI System partition. There's no guarantee the firmware itself won't write to the ESP, including by some other EFI program. In this case, the raid will become broken and it's not repairable. I think keeping ESPs synchronized should be the responsibility of something like bootupd. An alternative might be firmware RAID, which mdadm also supports.

As for $BOOT volume, upstream GRUB puts grubenv here. Fedora it's here for BIOS systems, where UEFI systems put it on the ESP. The grubenv is used in Fedora for the GRUB hidden menu feature. If GRUB knows grubenv is on md raid (or Btrfs) it will refuse to write to it, to avoid causing an inconsistent state. But if GRUB doesn't know it's md raid, it'll permit writes to grubenv. This probably isn't that bad of an inconsistency, because the way GRUB writes to grubenv is just by overwriting only the two 512 blocks making up grubenv, there's no fs metadata update at all. The bigger concern with metadata 1.0 has always been that it invites inadvertent mounting of the member device, rather than the array device. Once this happens, again the raid is broken and it's not reversible or repairable.

Because BIOS GRUB is configured to set prefix to the first disk, it must not have MD-RAID support preloaded

If GRUB doesn't know $BOOT is an mdadm device, how does fallback work when there's a read error or the device is missing? I'm expecting the point of going to the trouble of making $BOOT raid1, is if there's a problem with a member device, the bootloader can automatically use the other, and thus still boot the system. That's built into the GRUB mdraid1x.mod.

I think it's better to make the prefix the md device, and expect GRUB knows the true nature of the stack which is that $BOOT is an array device. And then use mdadm metadata version 1.2 which is both the recommended and default version, because it prevents the inadvertent use of the md member device.

@bgilbert
Copy link
Contributor Author

bgilbert commented Dec 5, 2020

Thanks for the comments!

There's no guarantee the firmware itself won't write to the ESP, including by some other EFI program. In this case, the raid will become broken and it's not repairable.

An earlier draft proposed to maintain independent ESPs on each disk, but that would make it infeasible to mount "the ESP" inside the OS. Periodic RAID resync would still fix any breakage, no?

As for $BOOT volume, upstream GRUB puts grubenv here. Fedora it's here for BIOS systems, where UEFI systems put it on the ESP. The grubenv is used in Fedora for the GRUB hidden menu feature.

Fedora CoreOS (and RHEL CoreOS) doesn't use that feature. We're reading the grubenv but nothing in our configs writes to it.

If GRUB doesn't know $BOOT is an mdadm device, how does fallback work when there's a read error or the device is missing?

We always boot from the first disk. If the first disk is missing, the second disk becomes the first disk.

@bgilbert
Copy link
Contributor Author

bgilbert commented Dec 5, 2020

An alternative might be firmware RAID, which mdadm also supports.

Right, but the firmware might not.

@cmurf
Copy link

cmurf commented Dec 5, 2020

An earlier draft proposed to maintain independent ESPs on each disk, but that would make it infeasible to mount "the ESP" inside the OS.

Yep, I understand. This immediately exposes the unfortunate paradigm of having the ESP persistently mounted. It was always a bad idea, but we did it because we didn't have a smarter way of doing it. On Windows and Mac OS, it is never persistently mounted, it is never exposed to the user at all in any way. The thing that "owns" the ESP, for modifications and updates, is responsible for mounting this file system, making changes, then unmounting it. So again, I'd say this is the realm of bootupd, and/or maybe fwupd. And we should stop putting the ESP in fstab.

In the fwupd case, it could do some test: prefer the ESP listed first in NVRAM boot order, mount that ESP and check if it has enough free space, if not try the other one and set a "boot next" NRAM entry. Maybe clean-up/garbage collection needs to go in there somewhere too. For bootupd it might be simpler, have an ESP files list on sysroot, and then use that as the source of authority, and the ESPs are just clones of that and should always be identical (at least, their directory on the ESP should be identical, there may also be a question about BLS directory... which I'll ignore for now).

Periodic RAID resync would still fix any breakage, no?

Nope. It's ambiguous which drive is correct. No checksums. In the degraded array case (legit degraded assembly and still mounting the md array device) the event count in the active mdadm super block for the active md member device is updated. So it is determinable how to scrub and "catch up" the device with the lower event count. But in this example, where writes happen outside the mdadm infrastructure, the mdadm superblocks are still identical. A scrub repair will make things worse, it will actually break both of the ESPs because, without checksums or versions, it just randomly picks a block that's assumed to be correct, and it overwrites the mismatching block. And it's not consistent how it does this.

Fedora CoreOS (and RHEL CoreOS) doesn't use that feature. We're reading the grubenv but nothing in our configs writes to it.

OK it may not be an issue for CoreOS now. But there's a pre-proposal for Silverblue to rebase on CoreOS and right now it does use this feature. There are any number of reasons why grubenv design is suboptimal, so really the solution is to fix that but ... resources.

We always boot from the first disk. If the first disk is missing, the second disk becomes the first disk.

The fallback is built into GRUB's mdraid support. It handles degraded operations.

I'm not sure what "second disk becomes the first" looks like in grub.cfg. But, if it's possible to script the fallback, I'm still skeptical about the handling of a failed drive that isn't missing, rather it just spews zeros or garbage, a failure mode pretty common with SSDs.

@bgilbert
Copy link
Contributor Author

Okay, thanks for pursuing this. I'm seeing three issues:

  1. Firmware might desynchronize a RAIDed ESP. My reading of drivers/md/raid1.c is that a resync will always copy from the first drive to the others, but any reads that happen before the resync might come from any drive, and of course any writes based on those reads are suspect. We could initiate a resync on every boot but I think it'd be hard to demonstrate that no suspect reads could slip in.

    I agree we should switch back to the independent-ESP model. 40ignition-ostree: copy ESP contents as independent filesystems #794 implements the OS half and config/fcos/v1_[34]: un-RAID ESP butane#178 implements the FCCT half.

  2. GRUB doesn't know that /boot is a RAID. The design assumed this was okay because any failed drive would be ignored by the firmware, but that's not necessarily the case; the firmware will boot from the first readable drive, but may still enumerate the bad drive and fail any I/O to it. In this case it's useful for GRUB to retry with another replica.

    At present, we can fix this fully on UEFI. We can fix it partially on BIOS, but a complete fix would require rerunning grub2-install, which we're not equipped to do yet (implement BIOS (grub) bootupd#53). In turn, that means we can't yet switch to an MD 1.2 superblock. grub: read from md/md-boot if it exists coreos-assembler#1979 switches to GRUB RAID support as much as possible, and Reinstall BIOS GRUB during first boot if /boot RAID is enabled fedora-coreos-tracker#702 proposes reinstalling BIOS GRUB using bootupd.

  3. grubenv writes will desynchronize /boot (before the changes in point 2) or will fail (after the changes). We have a consensus not to use grubenv for FCOS and RHCOS, and Silverblue is out of scope at present, so it appears there's nothing to be done here for now.

@cmurf
Copy link

cmurf commented Dec 21, 2020

a resync will always copy from the first drive to the others

That's less bad than I thought, but I agree ambiguity and risk remain.

This Workstation WG issue starts out about grubenv and /boot on Btrfs, but eventually comes around to drawing on bootupd as a possible way to decouple /boot and /boot/efi from RPM, i.e. a single source of truth on sysroot "how to boot the system". And likewise simplify boot and make it more reliable. My pipe dream is that ESP, BIOS Boot, and boot partitions become the sort of plumbing that isn't user facing at all, whether at install time or repair/replacement time.
https://pagure.io/fedora-workstation/issue/206#comment-706695

@cmurf
Copy link

cmurf commented Dec 25, 2020

Pipe dream cont'd ...
If /etc/fstab is not populated, I guess systemd's gpt-auto-generator can kick in and mount the ESP to /boot or /efi - so I wonder how to deconflict between gpt-auto-generator, bootupd, and fwupd?

In the simple single disk case, only bootupd/fwupd touching either /boot or /efi should mean the ESP gets mounted. Anything else probably should get an selinux denial.

But in the dual ESP case, systemd ignores extra ESPs. It only automounts the ESP on the drive that the bootloader says it used for booting. So now what? (a) teach systemd about multiple ESPs? (b) inhibit systemd ESP automount, and have bootupd/fwupd deal with all of it? (e.g. mount each of them in turn to some location in /run, do what needs to be done, then umount them).

@bgilbert
Copy link
Contributor Author

We don't ship the gpt-auto-generator. The plan is to pursue option (b); see for example coreos/bootupd#127.

c4rt0 pushed a commit to c4rt0/fedora-coreos-config that referenced this pull request Mar 27, 2023
ci: use the RHEL 8.5 repos on the mirror
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants