Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: failed to start rpm-ostree daemon after auto-update to 31.20191127.1 #322

Closed
lucab opened this issue Dec 5, 2019 · 9 comments
Closed

Comments

@lucab
Copy link
Contributor

lucab commented Dec 5, 2019

My canary node updated from 30.20191014.1 to 31.20191127.1 and properly rebooted into it; however it does not seem to be very healthy.

rpm-ostree daemon is not available anymore, as it fails to start with:

Dec 04 18:25:59 localhost systemd[1]: Starting RPM-OSTree System Management Daemon...
Dec 04 18:25:59 localhost rpm-ostree[748]: Reading config file '/etc/rpm-ostreed.conf'
Dec 04 18:25:59 localhost rpm-ostree[748]: error: Couldn't start daemon: Error setting up sysroot: Unexpected state: /run/ostree-booted found, but no /boot/loader directory
Dec 04 18:25:59 localhost systemd[1]: rpm-ostreed.service: Main process exited, code=exited, status=1/FAILURE

Indeed, /boot is empty:

$ ls -la /boot/
total 0
drwxr-xr-x.  2 root root   6 Jan  1  1970 .
drwxr-xr-x. 12 root root 253 Dec  4 18:25 ..

For some unknown reason, it looks like this machine ended up without a boot.mount unit:

$ systemctl cat boot.mount

No files found for boot.mount

Additionally, any node that ends up in this situation will need manual intervention. Zincati cannot auto-update out of it, as rpm-ostreed is not available.

This is a node on platform aws which was born on version 30.20191002.0, and up to this version it was updating without issues.

@jlebon
Copy link
Member

jlebon commented Dec 5, 2019

Ouch.

Hmm OK, I think I see what's going on here. In coreos/fedora-coreos-config#155, we switched from using static boot mount units to creating it from a generator. Which is fine, except that starting from coreos/fedora-coreos-config#219, we made those boot mounts now be conditional on /sysroot/.coreos-aleph-version.json. And of course, nodes booted from an image created with an old enough cosa (without coreos/coreos-assembler#768) won't have that file. So we don't generate a boot mount anymore. (And this didn't break RHCOS because it did have entries in /etc/fstab from Anaconda.)

So this probably bricked all the nodes that fall in that bucket. 😢 This is why #228 is crucial.

That said, I'm not sure if there's much we can do here. We'll probably just have to send a notice that updates broke and users that provisioned from versions older than X (I think 30.20191014.0) will have to reprovision?

lucab added a commit to lucab/fedora-coreos-streams that referenced this issue Dec 5, 2019
We fully rolled out `31.20191127.1` on `testing`, but it looks like
auto-updating from some old releases might not be safe.
Temporarily hiding this rollout while we investigate
coreos/fedora-coreos-tracker#322.
jlebon pushed a commit to coreos/fedora-coreos-streams that referenced this issue Dec 5, 2019
We fully rolled out `31.20191127.1` on `testing`, but it looks like
auto-updating from some old releases might not be safe.
Temporarily hiding this rollout while we investigate
coreos/fedora-coreos-tracker#322.
@dustymabe
Copy link
Member

We'll probably just have to send a notice that updates broke and users that provisioned from versions older than X (I think 30.20191014.0) will have to reprovision?

great.. I tested from 30.20191014.0 -> 30.20191014.1 -> 31.20191127.1 but not anything before 30.20191014.0.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Dec 5, 2019
Just look at `/etc/fstab` for `/boot` mounts to determine whether to
generate them ourselves or let the `systemd-fstab-generator` do it.

The `grep` stuff here is a bit gory but it should work for our purposes.
The next step up would be actually parsing it using the appropriate APIs
in a compiled language.

See also: coreos/fedora-coreos-tracker#322
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Dec 5, 2019
Just look at `/etc/fstab` for `/boot` mounts to determine whether to
generate them ourselves or let the `systemd-fstab-generator` do it.

See also: coreos/fedora-coreos-tracker#322
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Dec 5, 2019
Just look at `/etc/fstab` for `/boot` mounts to determine whether to
generate them ourselves or let the `systemd-fstab-generator` do it.

See also: coreos/fedora-coreos-tracker#322
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue Dec 5, 2019
Just look at `/etc/fstab` for `/boot` mounts to determine whether to
generate them ourselves or let the `systemd-fstab-generator` do it.

See also: coreos/fedora-coreos-tracker#322
@cgwalters
Copy link
Member

If anyone hits this, the workaround is just:

mount /dev/disk/by-label/boot /boot

Then upgrade and reboot.

@cgwalters
Copy link
Member

So this probably bricked all the nodes that fall in that bucket

"bricked" is a strong word here - yes, this didn't meet the quality bar I set for myself and our team, but there's a one liner workaround. I'd say this broke ability to receive new updates by default - "bricked" I think is more when the system cannot be recovered even manually.

@jlebon
Copy link
Member

jlebon commented Dec 6, 2019

"bricked" is a strong word here - yes, this didn't meet the quality bar I set for myself and our team, but there's a one liner workaround

Agreed.

To be clear, I think it's inevitable that we're going to hit these kinds of really subtle issues. There's a lot of "undeclared" dependencies between cosa, FCOS, and the pipeline. The exercise in having to "rewind the clock" to do another f30 release also demonstrated this.

In this particular instance, I don't think the problem was we didn't think hard enough. It was (and still is) that we have no coverage for upgrade testing.

jlebon added a commit to coreos/fedora-coreos-config that referenced this issue Dec 6, 2019
Just look at `/etc/fstab` for `/boot` mounts to determine whether to
generate them ourselves or let the `systemd-fstab-generator` do it.

See also: coreos/fedora-coreos-tracker#322
@jlebon
Copy link
Member

jlebon commented Dec 6, 2019

coreos/fedora-coreos-config#247 is merged now.

Was chatting with @lucab; instead of doing another async release, let's just pull forward the next release a bit and roll it in -- so e.g. early next week?

@mrguitar
Copy link

mrguitar commented Dec 6, 2019

thanks for posting this and the work around. It fixed my box. cheers

@dustymabe
Copy link
Member

This is obvious because /boot isn't mounted and rpm-ostreed isn't running, but it's worth noting that this is one case where rpm-ostree rollback doesn't work. In the vast majority of other failures I've seen with any of our ostree systems, rpm-ostree rollback was the golden ticket.

@lucab
Copy link
Contributor Author

lucab commented Dec 12, 2019

We released 31.20191211.1 which fixed this.

Nodes that are stuck require manually intervention to proceed with upgrades.
The workaround is to manually mount /boot, restart rpm-ostree and zincati, and let the
auto-upgrades proceed:

sudo mount /dev/disk/by-label/boot /boot
sudo systemctl reset-failed rpm-ostreed.service
sudo systemctl restart rpm-ostreed.service zincati.service

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants