Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initrd: failed to deploy ostree after reboot #905

Open
MrSmiths opened this issue Jul 20, 2021 · 9 comments
Open

initrd: failed to deploy ostree after reboot #905

MrSmiths opened this issue Jul 20, 2021 · 9 comments

Comments

@MrSmiths
Copy link

MrSmiths commented Jul 20, 2021

Bug Report

Environment

Intel(R) Xeon(R) CPU E5-2690 3xvCPU 3.5GB RAM/racknerd/KVM

Expected Behavior

reboot after upgrade to usable OS...

Actual Behavior

boot process stopped on error - "ostree-prepare-root[510]: ostree-prepare-root: Couldn't find specified OSTree root /sysroot//ostree/boot.0/fedora-coreos/095ceed8d893fa.../0 : no such file or directory"

Reproduction Steps

  1. download/mount/boot iso fedora-coreos-34.20210611.3.0-live.x86_64.iso
  2. coreos-installer install /dev/vda --copy-network -i /tmp/bla.ign
  3. poweroff
  4. boot from HD.. EDIT: zincati isn't culprit seems ignition-ostree-uuid-boot.service needs to be in failed state after 1st reboot to simulate issue "wait.. zincati will find 34.20210626.3.1 and reboot it to broken ostree.. screenshot in attached archive (3 of 6 servers did broke this way..)"

Other Information

During boot after zincati "upgrade" tang server was not contacted
EDIT: possible? workaround is to:

  1. stop/disable zincati.service immediately after 1st boot
  2. rpm-ostree cleanup -p
  3. test reboot - OK
  4. rpm-ostree upgrade
  5. test reboot - OK

FCOS nicely communicating with tang server..

attached file contains screenshot + yaml/ign file
bla.zip

@lucab lucab transferred this issue from coreos/zincati Jul 21, 2021
@lucab
Copy link
Contributor

lucab commented Jul 21, 2021

Thanks for the report, I've moved it to FCOS tracker as this isn't a Zincati bug.

During boot after zincati "upgrade" tang server was not contacted
workaround is to:
[...]
4. rpm-ostree upgrade
5. test reboot - OK

Auto-upgrades go through the same rpm-ostree flow, so there is possibly something more subtle going on here. It could be a network race or an unit ordering mixup in the initrd. It would be helpful to get the full journal from the failed boot.

@lucab lucab changed the title zincati breaking ostree initrd: failed to deploy ostree after reboot Jul 21, 2021
@dustymabe
Copy link
Member

Also: which version of FCOS was this node originally installed on?

@cgwalters
Copy link
Member

You can find that via cat /sysroot/.coreos-aleph-version.json

@MrSmiths
Copy link
Author

reimaged another server.. journals in archive attached..

  • 1st_boot.journal - after coreos-installer nice boot with tang communication, rootfs unlock, login prompt..

after 1st boot in console:

Fedora CoreOS 34.20210611.3.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/c/server/coreos/

[systemd]
Failed Units: 2
  emergency.service
  ignition-ostree-uuid-boot.service
[super_cow@bla_serv ~]$

[super_cow@bla_serv ~]$
Broadcast message from Zincati at Wed 2021-07-21 11:24:06 UTC:
New update 34.20210626.3.1 is available and has been deployed.
If permitted by the update strategy, Zincati will reboot into this update when all
interactive users have logged out, or in 10 minutes, whichever comes earlier.
Please log out of all active sessions in order to let the auto-update process continue.

[root@bla_serv super_cow]# journalctl -u ignition-ostree-uuid-boot.service
-- Journal begins at Wed 2021-07-21 11:18:43 UTC, ends at Wed 2021-07-21 11:24:55 UTC. --
Jul 21 11:18:47 localhost systemd[1]: Starting Ignition OSTree: Regenerate Filesystem UUID (boot)...
Jul 21 11:18:47 localhost ignition-ostree-firstboot-uuid[459]: e2fsck 1.45.6 (20-Mar-2020)
Jul 21 11:18:47 localhost ignition-ostree-firstboot-uuid[459]: Pass 1: Checking inodes, blocks, and sizes
Jul 21 11:18:48 localhost ignition-ostree-firstboot-uuid[459]: Pass 2: Checking directory structure
Jul 21 11:18:48 localhost ignition-ostree-firstboot-uuid[459]: Pass 3: Checking directory connectivity
Jul 21 11:18:48 localhost ignition-ostree-firstboot-uuid[459]: Pass 4: Checking reference counts
Jul 21 11:18:48 localhost ignition-ostree-firstboot-uuid[459]: Pass 5: Checking group summary information
Jul 21 11:18:48 localhost ignition-ostree-firstboot-uuid[459]: boot: 324/98304 files (0.3% non-contiguous), 111961/393216 blocks
Jul 21 11:18:48 localhost ignition-ostree-firstboot-uuid[460]: tune2fs 1.45.6 (20-Mar-2020)
Jul 21 11:18:48 localhost ignition-ostree-firstboot-uuid[460]: This operation requires a freshly checked filesystem.
Jul 21 11:18:48 localhost ignition-ostree-firstboot-uuid[460]: Please run e2fsck -f on the filesystem.
Jul 21 11:18:48 localhost systemd[1]: ignition-ostree-uuid-boot.service: Main process exited, code=exited, status=1/FAILURE
Jul 21 11:18:48 localhost systemd[1]: ignition-ostree-uuid-boot.service: Failed with result 'exit-code'.
Jul 21 11:18:48 localhost systemd[1]: Failed to start Ignition OSTree: Regenerate Filesystem UUID (boot).
[root@bla_serv super_cow]#

[root@bla_serv super_cow]#
[root@bla_serv super_cow]#
[root@bla_serv super_cow]# mount |grep boot
/dev/vda3 on /boot type ext4 (ro,nosuid,nodev,relatime,seclabel)
[root@bla_serv super_cow]# umount /dev/vda3
[root@bla_serv super_cow]# e2fsck -f /dev/vda3
e2fsck 1.45.6 (20-Mar-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
boot: 323/98304 files (0.3% non-contiguous), 111961/393216 blocks
[root@bla_serv super_cow]# mount /dev/vda3 /boot/
[root@bla_serv super_cow]#
[root@bla_serv super_cow]# rpm-ostree status
State: idle
AutomaticUpdatesDriver: Zincati
  DriverState: active; update staged: 34.20210626.3.1; reboot delayed due to active user sessions
Deployments:
  ostree://fedora:fedora/x86_64/coreos/stable
                   Version: 34.20210626.3.1 (2021-07-14T14:49:01Z)
                    Commit: 252fffde6f56d183a3c51c05a0c602b61011f6cb4de23a58313ba3b0023dc360
              GPGSignature: Valid signature by 8C5BA6990BDB26E19F2A1A801161AE6945719A39
                      Diff: 48 upgraded

● ostree://fedora:fedora/x86_64/coreos/stable
                   Version: 34.20210611.3.0 (2021-06-28T15:43:17Z)
                    Commit: 0bdf0aee2585cafb224be19eec3b77501cb2044f028cf43a78f4de7ebd7c1a47
              GPGSignature: Valid signature by 8C5BA6990BDB26E19F2A1A801161AE6945719A39
[root@bla_serv super_cow]#

every boot after automatic upgrade:

  • after_upgrade_broken_OS_boot-rdsosreport.txt
  • after_upgrade_broken_OS_boot.journal
  • during_broken_boot.png

@MrSmiths
Copy link
Author

reimaged again with zincati.service masked..
again 2 services failed after ignite..

  • emergency.service
  • ignition-ostree-uuid-boot.service
    rpm-ostree showing only original ostree.. zincati is down..
    reboot.. -> server - KO..

@MrSmiths
Copy link
Author

another reimage.. the same VM, ign file..

  • clean boot
  • /dev/vda3 mounted on /boot clean
  • reboot -> server OK..
Fedora CoreOS 34.20210611.3.0
Tracker: https://github.com/coreos/fedora-coreos-tracker
Discuss: https://discussion.fedoraproject.org/c/server/coreos/

Last login: Wed Jul 21 21:52:15 2021 from 212.102.48.76
[super_cow@bla_serv ~]$ sudo su
[sudo] password for super_cow:
[root@bla_serv super_cow]#
[root@bla_serv super_cow]#
[root@bla_serv super_cow]# mount |grep vda3
/dev/vda3 on /boot type ext4 (ro,nosuid,nodev,relatime,seclabel)
[root@bla_serv super_cow]#
[root@bla_serv super_cow]# rpm-ostree status
State: idle
Deployments:
● ostree://fedora:fedora/x86_64/coreos/stable
                   Version: 34.20210611.3.0 (2021-06-28T15:43:17Z)
                    Commit: 0bdf0aee2585cafb224be19eec3b77501cb2044f028cf43a78f4de7ebd7c1a47
              GPGSignature: Valid signature by 8C5BA6990BDB26E19F2A1A801161AE6945719A39
[root@bla_serv super_cow]#
[root@bla_serv super_cow]#
[root@bla_serv super_cow]# journalctl -u ignition-ostree-uuid-boot.service
-- Journal begins at Wed 2021-07-21 21:30:25 UTC, ends at Wed 2021-07-21 21:54:14 UTC. --
Jul 21 21:30:29 localhost systemd[1]: Starting Ignition OSTree: Regenerate Filesystem UUID (boot)...
Jul 21 21:30:29 localhost ignition-ostree-firstboot-uuid[459]: e2fsck 1.45.6 (20-Mar-2020)
Jul 21 21:30:29 localhost ignition-ostree-firstboot-uuid[459]: Pass 1: Checking inodes, blocks, and sizes
Jul 21 21:30:29 localhost ignition-ostree-firstboot-uuid[459]: Pass 2: Checking directory structure
Jul 21 21:30:29 localhost ignition-ostree-firstboot-uuid[459]: Pass 3: Checking directory connectivity
Jul 21 21:30:29 localhost ignition-ostree-firstboot-uuid[459]: Pass 4: Checking reference counts
Jul 21 21:30:29 localhost ignition-ostree-firstboot-uuid[459]: Pass 5: Checking group summary information
Jul 21 21:30:29 localhost ignition-ostree-firstboot-uuid[459]: boot: 324/98304 files (0.3% non-contiguous), 111961/393216 blocks
Jul 21 21:30:30 localhost ignition-ostree-firstboot-uuid[460]: tune2fs 1.45.6 (20-Mar-2020)
Jul 21 21:30:30 localhost ignition-ostree-firstboot-uuid[456]: Regenerated UUID for /dev/disk/by-label/boot
Jul 21 21:30:30 localhost systemd[1]: Finished Ignition OSTree: Regenerate Filesystem UUID (boot).
Jul 21 21:33:32 localhost systemd[1]: ignition-ostree-uuid-boot.service: Deactivated successfully.
Jul 21 21:33:32 localhost systemd[1]: Stopped Ignition OSTree: Regenerate Filesystem UUID (boot).
[root@bla_serv super_cow]#

@MrSmiths
Copy link
Author

2x clean reignates with reboot OK
3rd:

  • again 2 services failed
  • vda3 mounted in /boot
  • reboot -> server KO
sudo su -
Last login: Wed Jul 21 22:43:21 UTC 2021 on pts/0
[systemd]
Failed Units: 2
  emergency.service
  ignition-ostree-uuid-boot.service
[root@bla_serv ~]# systemctl status ignition-ostree-uuid-boot.service
× ignition-ostree-uuid-boot.service
     Loaded: not-found (Reason: Unit ignition-ostree-uuid-boot.service not found.)
     Active: failed (Result: exit-code) since Wed 2021-07-21 22:17:34 UTC; 28min ago
   Main PID: 457 (code=exited, status=1/FAILURE)
        CPU: 25ms

Jul 21 22:17:34 localhost ignition-ostree-firstboot-uuid[460]: Pass 3: Checking directory connectivity
Jul 21 22:17:34 localhost ignition-ostree-firstboot-uuid[460]: Pass 4: Checking reference counts
Jul 21 22:17:34 localhost ignition-ostree-firstboot-uuid[460]: Pass 5: Checking group summary information
Jul 21 22:17:34 localhost ignition-ostree-firstboot-uuid[460]: boot: 324/98304 files (0.3% non-contiguous), 111961/393216 blocks
Jul 21 22:17:34 localhost ignition-ostree-firstboot-uuid[461]: tune2fs 1.45.6 (20-Mar-2020)
Jul 21 22:17:34 localhost ignition-ostree-firstboot-uuid[461]: This operation requires a freshly checked filesystem.
Jul 21 22:17:34 localhost ignition-ostree-firstboot-uuid[461]: Please run e2fsck -f on the filesystem.
Jul 21 22:17:34 localhost systemd[1]: ignition-ostree-uuid-boot.service: Main process exited, code=exited, status=1/FAILURE
Jul 21 22:17:34 localhost systemd[1]: ignition-ostree-uuid-boot.service: Failed with result 'exit-code'.
Jul 21 22:17:34 localhost systemd[1]: Failed to start Ignition OSTree: Regenerate Filesystem UUID (boot).
[root@bla_serv ~]#
[root@bla_serv ~]# blkid
/dev/vda3: LABEL="boot" UUID="96d15588-3596-4b3c-adca-a2ff7279ea63" BLOCK_SIZE="1024" TYPE="ext4" PARTLABEL="boot" PARTUUID="7c6b848f-d990-4031-ae10-28b085eb226b"
/dev/mapper/root: LABEL="root" UUID="0b14989c-f563-4ca2-810b-d6648798c503" BLOCK_SIZE="4096" TYPE="ext4"
/dev/sr0: BLOCK_SIZE="2048" UUID="2021-06-28-16-21-47-00" LABEL="fedora-coreos-34.20210611.3.0" TYPE="iso9660" PTUUID="04d83a37" PTTYPE="dos"
/dev/vda1: PARTLABEL="BIOS-BOOT" PARTUUID="20b4dcab-8632-4f5c-94a6-8410fb33fca6"
/dev/vda2: SEC_TYPE="msdos" LABEL_FATBOOT="EFI-SYSTEM" LABEL="EFI-SYSTEM" UUID="4A06-02B4" BLOCK_SIZE="512" TYPE="vfat" PARTLABEL="EFI-SYSTEM" PARTUUID="ed5a9f87-41cb-4b50-8761-ed8a77f58caa"
/dev/vda4: UUID="ef8dc3fa-1edc-49b3-9f89-69ae63e3a9f6" LABEL="luks-root" TYPE="crypto_LUKS" PARTLABEL="root" PARTUUID="33f25636-b24d-4f28-a805-5ec1351670df"

@dustymabe
Copy link
Member

dustymabe commented Jul 22, 2021

hmm. OK. I actually have some experience with ignition-ostree-uuid-boot.service failing because of #735

At least for me, I was hitting the issue because my system didn't have a hwclock and thus the system on every boot started many days in the past until NTP set it correctly. This caused tune2fs to balk because one date was older than the other in the ext4 metadata for the filesystem.

It appears a clock issue is not what is causing the failure for you (at least the logs seem to show a time not in the far distant past). So not sure exactly why you are hitting the problem, but.. the fix that we did for #735 had two parts:

  1. works around the time issue by artificially modifying some time values
  2. works around the need for e2fsck at all if the filesystem was created with the metadata_csum_seed feature flag.

So fix 1. should be active in 34.20210626.3.1 and newer. For fix 2. we switched over the filesystem to be created with the metadata_csum_seed feature flag in coreos/fedora-coreos-config#1063 , but that has not yet made it to stable, but should next week. For now you could try with the latest testing (34.20210711.2.1).

@dustymabe
Copy link
Member

could you now try the latest stable released today? 34.20210711.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants