Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootupd fails on mirrored boot disks #1485

Open
jbpratt opened this issue May 2, 2023 · 8 comments
Open

bootupd fails on mirrored boot disks #1485

jbpratt opened this issue May 2, 2023 · 8 comments
Labels

Comments

@jbpratt
Copy link

jbpratt commented May 2, 2023

Describe the bug

Follow up issue to https://discussion.fedoraproject.org/t/bootctl-update-fails-with-failed-to-update-efi-failed-to-find-esp-device/81663/3

coreos/bootupd#132

My CoreOS machine was off for a few weeks and upon reboot I was met with:

error: ../../grub-core/loader/arm64/linux.c:60:invalid magic number.
error: ../../grub-core/loader/arm64/linux.c:279:you need to load the kernel
first.

Press any key to continue...

running 37.20230401.3.0. I’m able to boot using the previous snapshot 37.20230322.3.0

I’m seeing coreos-bootupctl-update-aarch64.service is failing which I missed initially:

[core@majora ~]$ sudo systemctl status coreos-bootupctl-update-aarch64.service
× coreos-bootupctl-update-aarch64.service - Update aarch64 Bootloader
     Loaded: loaded (/usr/lib/systemd/system/coreos-bootupctl-update-aarch64.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Wed 2023-04-26 19:26:43 CDT; 19s ago
    Process: 1184 ExecStart=/usr/bin/bootupctl update (code=exited, status=1/FAILURE)
   Main PID: 1184 (code=exited, status=1/FAILURE)
        CPU: 11ms

Apr 26 19:26:42 majora systemd[1]: Starting coreos-bootupctl-update-aarch64.service - Update aarch64 Bootloader...
Apr 26 19:26:43 majora bootupctl[1184]: error: internal error: Failed to update EFI: Failed to find ESP device
Apr 26 19:26:43 majora systemd[1]: coreos-bootupctl-update-aarch64.service: Main process exited, code=exited, status=1/FAILURE
Apr 26 19:26:43 majora systemd[1]: coreos-bootupctl-update-aarch64.service: Failed with result 'exit-code'.
Apr 26 19:26:43 majora systemd[1]: Failed to start coreos-bootupctl-update-aarch64.service - Update aarch64 Bootloader.

Here is some additional information about the system:

 [core@majora ~]$ sudo bootupctl status
Component EFI
  Installed: grub2-efi-aa64-1:2.06-10.fc35.aarch64,shim-aa64-15.4-5.aarch64
  Update: Available: grub2-efi-aa64-1:2.06-88.fc37.aarch64,shim-aa64-15.6-2.aarch64
No components are adoptable.
CoreOS aleph image ID: fedora-coreos-35.20220424.3.0-metal.aarch64.raw
Boot method: EFI

[core@majora ~]$ rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Wed 2023-04-26 19:41:31 UTC)
Deployments:
  fedora:fedora/aarch64/coreos/stable
                  Version: 37.20230401.3.0 (2023-04-17T16:29:37Z)
               BaseCommit: e9287e4ec341ea061ce9763e5fea574c1d8e2e61b0a4bb2638960d6d24ae90a4
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
            SecAdvisories: 1 unknown severity, 1 low, 2 moderate
                     Diff: 36 upgraded, 2 added
          LayeredPackages: ...
           EnabledModules: cri-o:1.25

● fedora:fedora/aarch64/coreos/stable
                  Version: 37.20230322.3.0 (2023-04-03T20:34:30Z)
               BaseCommit: 7ca8df047f700f912dea94c3cb997d833cad064e0cd8490cfa3fe3eaf2c64f1c
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
          LayeredPackages: ...
           EnabledModules: cri-o:1.25
                   Pinned: yes

The machine is a [Honeycomb LX2 1](https://www.solid-run.com/wp-content/uploads/2021/01/HoneyComb-LX2-Datasheet.pdf)

Reproduction steps

Install 37.20230322.3.0 with mirrored boot disks and try to upgrade

Expected behavior

Successfully upgrading!

Actual behavior

Fails to boot with

error: ../../grub-core/loader/arm64/linux.c:60:invalid magic number.
error: ../../grub-core/loader/arm64/linux.c:279:you need to load the kernel
first.

Press any key to continue...

System details

[core@majora ~]$ rpm-ostree status -b
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Tue 2023-05-02 20:32:51 UTC)
BootedDeployment:
● fedora:fedora/aarch64/coreos/stable
                  Version: 37.20230322.3.0 (2023-04-03T20:34:30Z)
               BaseCommit: 7ca8df047f700f912dea94c3cb997d833cad064e0cd8490cfa3fe3eaf2c64f1c
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A
          LayeredPackages: cockpit-networkmanager cockpit-ostree cockpit-podman cockpit-selinux cockpit-storaged cockpit-system cri-o cri-tools kubeadm kubectl
                           kubelet
           EnabledModules: cri-o:1.25
                   Pinned: yes

Bare metal, arm64

Butane or Ignition config

No response

Additional information

No response

@cgwalters
Copy link
Member

(One confusion I had here is that the relevant systemd unit was removed shortly after shipping it in coreos/fedora-coreos-config@ff39c7d
...if we didn't have barriers per the discussion in #1263 then we would have needed to keep that unit around)

Anyways right...one thing that probably would have helped here is to have that unit actively block upgrades if it failed instead of just being entirely disconnected from the chain of zincati/rpm-ostree.

Fixing that probably wants better rpm-ostree/bootupd integration, and then for zincati to drive that.

But, that's just a mechanism to detect and block upgrades in this scenario, obviously we should fix this case too, which has bootupd issue as you've already linked in coreos/bootupd#132

@bgilbert
Copy link
Contributor

bgilbert commented May 3, 2023

We don't have a good way to signal that a machine is no longer updating, unless someone happens to SSH in and look, so it's a tough call. There's an argument that it's better to break a machine than to quietly stop applying security updates.

@dustymabe
Copy link
Member

(One confusion I had here is that the relevant systemd unit was removed shortly after shipping it in coreos/fedora-coreos-config@ff39c7d
...if we didn't have barriers per the discussion in #1263 then we would have needed to keep that unit around)

NO

If we didn't have barriers then every aarch64 FCOS node out there would have no doubt failed to upgrade at some point unless we decided to pin on a <6.2 kernel forever. Barriers allowed us to force systems through a point in the graph that had a <6.2 kernel AND had coreos-bootupctl-update-aarch64.service so we could ensure they would successfully update. It was not an option for us to just "keep that unit around"; i.e. barriers are good for more than just dropping migration code. We were up against timelines we didn't fully control (6.2 kernel landing).

More context on this in #1441

@dustymabe
Copy link
Member

At this point do we have any workarounds we can document so we can help @jbpratt and others recover their systems?

@bgilbert
Copy link
Contributor

bgilbert commented May 3, 2023

It should be possible to fix the bootloader by booting the old kernel, mounting the EFI partition on each disk, and manually copying in the new shim and GRUB. I'm not sure whether that would confuse bootupd, though.

@jlebon
Copy link
Member

jlebon commented May 3, 2023

It should be possible to fix the bootloader by booting the old kernel, mounting the EFI partition on each disk, and manually copying in the new shim and GRUB. I'm not sure whether that would confuse bootupd, though.

Indeed, that should work. Rather than copying the files by hand, you can do an invocation similar to what bootupd does. Here's a sample script of what all this would look like:

sudo mount /dev/disk/by-label/esp-1 /boot/efi
sudo cp -rp /usr/lib/bootupd/updates/EFI /boot/efi
sudo umount /boot/efi
sudo mount /dev/disk/by-label/esp-2 /boot/efi
sudo cp -rp /usr/lib/bootupd/updates/EFI /boot/efi
sudo umount /boot/efi

bootupd should normally know how to reuse existing mounts, but I can't get it to work in a quick test (and anyway, the bootupd-state.json file is shared, so you'd need additional hacks for the second one to force a bootupd update). Once bootupd gets RAID support, it will think the EFI blobs are older than they really are, but that shouldn't prevent it from updating further. (At least testing locally, manually modified EFI files doesn't prevent bootupctl update from proceeding.)

@dustymabe
Copy link
Member

@jbpratt - does that workaround work for you?

@jbpratt
Copy link
Author

jbpratt commented May 5, 2023

hey @dustymabe @jlebon that did seem to work! Thank you for the help 🐱

[core@majora ~]$ rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; periodically polling for updates (last checked Fri 2023-05-05 13:30:40 UTC)
Deployments:
● fedora:fedora/aarch64/coreos/stable
                  Version: 38.20230414.3.0 (2023-05-01T22:13:51Z)
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants