rpm-ostree/deploy: downgrade logic mistakenly blocks some upgrades #481

lucab · 2020-05-14T13:56:16Z

Deployments:
* ostree://fedora:fedora/x86_64/coreos/testing
                   Version: 31.20200420.2.0 (2020-04-21T16:48:43Z)
                    Commit: 0098d20fe3cea0fbee2ecd5f6e9e7d3b1c45846b61b389bce8f37ddcba690bf5
              GPGSignature: Valid signature by 7D22D5867F2A4236474BF7B850CB390B3C3359C4

Currently, on the testing channel there are two rollouts:

31.20200505.2.0 at 100%
31.20200505.2.1 at 0% (which will start in a few minutes)

Right now, creating a new testing machine using an older OS image (31.20200420.2.0 in the example above), results in the following error in Zincati journal:

[INFO ] starting update agent (zincati 0.0.9)
...
[INFO ] initialization complete, auto-updates logic enabled
[INFO ] new release '31.20200505.2.0' selected, proceeding to stage it
[ERROR] failed to stage deployment: rpm-ostree deploy failed:
 error: Upgrade target
 revision 'a4395eae3e1844d806a79b5cf51d44e60c96c7ab261a715f5d7fd89584c6963b'
 with timestamp 'Wed May  6 16:42:59 2020'
 is chronologically older than current
 revision 'a047a3c97511d242bab1a9c136debcf1cf47981bea9deb7affa38fb202dae0b3'
 with timestamp 'Wed May 13 23:33:50 2020'; use --allow-downgrade to permit

Cross-referencing releases with their revisions, it seems that:

rpm-ostree status and Zincati agree that 31.20200420.2.0 is the booted deployment
Zincati is asking to update 31.20200420.2.0 -> 31.20200505.2.0
rpm-ostreed is detecting and blocking a downgrade 31.20200505.2.1 -> 31.20200505.2.0

The text was updated successfully, but these errors were encountered:

jlebon · 2020-05-14T14:07:06Z

I'm investigating this. I think it's an issue in libostree's handling of downgrade protection in combination with how rpm-ostree handles deploy revision=....

So this might have broken automatic updates. An easy manual workaround which still respects barriers is something like:

$ mkdir -p /run/zincati/config.d
$ echo 'updates.allow_downgrade = true' > /run/zincati/config.d/allow-downgrade.toml
$ systemctl restart zincati

jlebon · 2020-05-14T14:13:03Z

We also need to expand upgrade testing to include older revisions than just the previous release. That would've caught this.

lucab · 2020-05-14T14:17:07Z

For reference, the downgrade-blocking logic is active since Zincati 0.0.7: https://github.com/coreos/zincati/releases/tag/0.0.7.

We introduced a barrier on testing in the past for 30.20191014.1, but at that point Zincati was still at 0.0.5.

bgilbert · 2020-05-14T14:22:44Z

So this might have broken automatic updates.

This should fix itself, though, right? Once a node is permitted to update to 31.20200505.2.1, won't it do so successfully?

jlebon · 2020-05-14T14:27:47Z

So this might have broken automatic updates.

This should fix itself, though, right? Once a node is permitted to update to 31.20200505.2.1, won't it do so successfully?

Yeah, I should be more precise. I think this breaks automatic updates for nodes sitting before an update barrier and there are releases past the barrier. Since as Luca mentioned, we only had one barrier so far, and downgrade protection wasn't on back then, this is OK for now. But we need to resolve this before creating any new barriers (which we're contemplating doing for the f32 update: #480).

lucab · 2020-05-14T14:59:13Z

In some cases it may not fix itself. As per Zincati state-machine, once an update is detected the client keeps trying staging+finalizing it.

The machine in the original report is stuck in such state, even if now the newest rollout is available:

$ sudo socat - UNIX-CONNECT:/run/zincati/public/metrics.promsock
zincati_rpm_ostree_deploy_attempts_total 6
zincati_rpm_ostree_deploy_failures_total 6

dustymabe · 2020-05-14T15:12:19Z

IIUC the way a client would get into a "stuck forever" state is by attempting an upgrade to N-1 when N is staged but not yet started to rollout. Practically that should be:

new nodes that were created with N-2 or older in the period of time when N is staged but not yet started to roll out.

Any auto updating nodes should have already been on N-1 so it should be mostly limited to the above mentioned scenario, correct?

jlebon · 2020-05-14T15:20:04Z

In some cases it may not fix itself. As per Zincati state-machine , once an update is detected the client keeps trying staging+finalizing it.

Hmm, I wonder if it should go back to ReportedSteady instead if staging failed? Let's say e.g. rpm-ostree is hitting up against libostree's ENOSPC protection for a few days before I realize and clean up/expand storage, I think I'd expect my node to reboot into the latest update, not what the latest update was when the alert went off and possibly double reboot.

dustymabe · 2020-05-14T16:17:41Z

new nodes that were created with N-2 or older in the period of time when N is staged but not yet started to roll out.

clarification/update:

new nodes that were created with N-2 or older in the period of time when N is staged but not yet 100% rolled out. i.e., the period of time in which the newly installed node could attempt to upgrade to N-1, instead of N.

Both libostree and rpm-ostree support downgrade protection. But what that means is different between the two. For libostree, downgrade protection means not fetching any commit which is older than what the current ref is pointing at. For rpm-ostree, it means not *deploying* any commit which is older than what the current *deployment* is on. These two are mostly the same most of the time, but can differ. For example, on a remote ref which has commits A -> B -> C, where the client is sitting on a deployment from A, downgrade protection should not prevent the client from upgrading to B even if there is a newer commit C. Since there is no hard relation enforced between what the state of the OSTree ref is locally and deployments (e.g. we fully support users manually doing `ostree pull`), it doesn't make sense to compare against the tip of the ref. Instead, use the new `timestamp-check-from-rev` to tell libostree to compare against our base revision, which is what we care about. Closes: coreos/fedora-coreos-tracker#481

The way `timestamp-check` works might be too restrictive in some situations. Essentially, we need to support the case where users want to pull an older commit than the current tip, but while still guaranteeing that it is newer than some even older commit. This will be used in Fedora CoreOS. For more information see: coreos/rpm-ostree#2094 coreos/fedora-coreos-tracker#481

jlebon · 2020-05-14T17:52:21Z

This is fixed with ostreedev/ostree#2099 and coreos/rpm-ostree#2094!

lucab · 2020-05-14T17:53:14Z

A tentative enhancement on Zincati side is at coreos/zincati#286.

The way `timestamp-check` works might be too restrictive in some situations. Essentially, we need to support the case where users want to pull an older commit than the current tip, but while still guaranteeing that it is newer than some even older commit. This will be used in Fedora CoreOS. For more information see: coreos/rpm-ostree#2094 coreos/fedora-coreos-tracker#481

Both libostree and rpm-ostree support downgrade protection. But what that means is different between the two. For libostree, downgrade protection means not fetching any commit which is older than what the current ref is pointing at. For rpm-ostree, it means not *deploying* any commit which is older than what the current *deployment* is on. These two are mostly the same most of the time, but can differ. For example, on a remote ref which has commits A -> B -> C, where the client is sitting on a deployment from A, downgrade protection should not prevent the client from upgrading to B even if there is a newer commit C. Since there is no hard relation enforced between what the state of the OSTree ref is locally and deployments (e.g. we fully support users manually doing `ostree pull`), it doesn't make sense to compare against the tip of the ref. Instead, use the new `timestamp-check-from-rev` to tell libostree to compare against our base revision, which is what we care about. Closes: coreos/fedora-coreos-tracker#481

dustymabe · 2020-05-14T18:28:06Z

Thanks @jlebon and @lucab for jumping on this problem and doing a root cause so quickly this morning (EDT) from the limited information I had available. Now that we have a bead on fixing this problem for future deployments I'd like to brainstorm if there is any possible way we can fix it without human intervention for existing systems who are currently "stuck" in the "attempting to upgrade" state.

One idea I have is to:

change cincinnati to not provide any update edges temporarily
- systems in non-"stuck" state won't be given any update paths
methodically modify our OSTree server repo to make the ref for each stream be {N-N}...N for a brief period of time (30 minutes each maybe?)
- foreach ver in {N-N}..N; set ref to $commitforver; sleep 30m; done
change cincinnati back to serving upgrade edges.

This would mean for a period of time we would no longer fail the "downgrade protection" check. It looks like zincati retries every 5 minutes so 30 minutes should be plenty I would think.

In order to not have a long period where we aren't serving update edges via cincinnati we could spread out this across several days or weeks (i.e. just do a few per day or something).

jlebon · 2020-05-14T22:04:55Z

This would mean for a period of time we would no longer fail the "downgrade protection" check. It looks like zincati retries every 5 minutes so 30 minutes should be plenty I would think.

We discussed this on IRC, but essentially: this won't work around the downgrade protection check because those nodes would've already fetched the latest ref and that's what downgrade protection checks against.

So looking at the last 4 stable releases here and their Zincati versions, we have:

31.20200420.3.0: 0.0.9
31.20200407.3.0: 0.0.9
31.20200323.3.2: 0.0.9
31.20200310.3.0: 0.0.6

So there have been two rollouts so far where Zincati 0.0.9 (which has downgrade protection) was used. But since nodes need to be at N-2 or older to hit this, I think the only time window during which a node could enter the Zincati deadend state are those started at 31.20200323.3.2 which witnessed the 31.20200420.3.0 rollout and were given an edge to 31.20200407.3.0. That rollout lasted 48h, during which it became progressively less likely that you'd be handed out that edge.

Looking at testing, the problem is worse. The first release with 0.0.9 was 31.20200323.2.0, and there have been 5 rollouts since (which means 4 rollout windows during which if you brought up a node older than N-1 but newer or at 31.20200323.2.0, you could enter the deadend state depending on the edge Cincinnati gave you).

It'd be good to have some way to measure how many nodes are stuck there. The good news is that for stable it seems like a pretty limited exposure.

That said, we shouldn't add any more rollouts without ostreedev/ostree#2099 and coreos/rpm-ostree#2094 (and/or coreos/zincati#286) in so we don't increase further the number of starting nodes that are susceptible to this.

As a mild mitigation for coreos/fedora-coreos-tracker#481, let's remove the previous rollout from the graph so that no more nodes get into that state.

This is a short-term hack until we can depend on the new `timestamp-check-from-rev` from ostree: ostreedev/ostree#2099 That way, we still get downgrade protection, but wrt the checked out deployment, not the local ref. For more information, see #2094 coreos/fedora-coreos-tracker#481

For coreos/fedora-coreos-tracker#481.

jlebon · 2020-05-15T20:50:46Z

New rpm-ostree with fix: coreos/fedora-coreos-config#401
Though we should keep this open until we expand our tests.

For coreos/fedora-coreos-tracker#481.

dustymabe · 2020-05-18T13:35:47Z

We discussed this on IRC, but essentially: this won't work around the downgrade protection check because those nodes would've already fetched the latest ref and that's what downgrade protection checks against.

@jlebon correct me if I'm wrong here but during our video meeting on Friday I think we looked at the code and it turns out we do think #481 (comment) would work because when fetching the ref we don't consider downgrade protection.

I don't know if it's something we'll do but it's an option.

dustymabe · 2020-05-18T13:50:50Z

@lucab - @jlebon and I chatted more towards the end of the day on Friday. We had the option of backporting just the fix for rpm-ostree into a testing release to do today (Monday) and then later in the week doing a stable release based on that testing release. We hit a complication because the rpm-ostree status in dist-git had moved past the rpm-ostree version that was already in FCOS stable/testing; we did a snapshot build recently for COSA. This means we'd have to make a backport branch in order to do the backport of just the fix for rpm-ostree, which we could easily do, but...

Rather than doing all of that, since it was late on Friday, we decided to just move ahead with the newly released version of rpm-ostree (including the fix and new features) into testing-devel and then forge ahead with the round of releases for this week as planned. This means this week's testing release will contain the new release of rpm-ostree, but the stable release will not because we want to soak test the new release of rpm-ostree like we normally do.

We think this is OK because we know how to mitigate this problem server side by removing the "to N-1" upgrade edge in the graph once we've started to release N. So we'll just continue to do that for a period of time until we think enough people have moved on from the affected FCOS starting points.

dustymabe · 2020-05-20T21:15:56Z

This landed upstream in rpm-ostree and a new rpm-ostree release was done.

The fix for this went into testing stream release 31.20200517.2.0.

dustymabe · 2020-06-04T13:56:27Z

The fix for this went into stable stream release 31.20200517.3.0.

lucab · 2020-07-02T13:38:34Z

For anybody that has a machine which is blocked in this state, it is possible to manually allow downgrades for the current boot in order to get back into the normal auto-updates flow:

echo 'updates.allow_downgrade = true' | sudo tee /run/zincati/config.d/99-issues-481.toml
sudo systemctl restart zincati

After the first successful upgrade, the configuration fragment will be automatically cleaned (with the rest of /run).

Older FCOS will complain about an upgrade target being 'chronologically older than current' This is documented in coreos/fedora-coreos-tracker#481 We'll workaround the problem here in a config dropin.

lucab changed the title ~~rpm-ostree: deploy downgrade logic mistakenly blocks some upgrades~~ rpm-ostree/deploy: downgrade logic mistakenly blocks some upgrades May 14, 2020

lucab added area/updates component/rpm-ostree component/zincati kind/bug kind/regression labels May 14, 2020

jlebon self-assigned this May 14, 2020

lucab mentioned this issue May 14, 2020

update-agent: react to persistent deploy failure coreos/zincati#286

Merged

jlebon mentioned this issue May 14, 2020

upgrader: Pull using new timestamp-check-from-rev coreos/rpm-ostree#2094

Closed

jlebon mentioned this issue May 14, 2020

lib/pull: Add timestamp-check-from-rev ostreedev/ostree#2099

Merged

jlebon added the jira for syncing to jira label May 14, 2020

jlebon mentioned this issue May 15, 2020

updates/testing: remove previous rollout coreos/fedora-coreos-streams#103

Merged

jlebon added a commit to coreos/fedora-coreos-streams that referenced this issue May 15, 2020

updates/testing: remove previous rollout

8af2466

As a mild mitigation for coreos/fedora-coreos-tracker#481, let's remove the previous rollout from the graph so that no more nodes get into that state.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue May 15, 2020

overrides: fast-track rpm-ostree 2020.2

1a667ec

For coreos/fedora-coreos-tracker#481.

jlebon mentioned this issue May 15, 2020

overrides: fast-track rpm-ostree 2020.2 coreos/fedora-coreos-config#401

Merged

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue May 15, 2020

overrides: fast-track rpm-ostree 2020.2

d2bb089

For coreos/fedora-coreos-tracker#481.

jlebon mentioned this issue May 15, 2020

[next-devel] overrides: fast-track rpm-ostree 2020.2 coreos/fedora-coreos-config#402

Merged

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue May 15, 2020

overrides: fast-track rpm-ostree 2020.2

9a24822

For coreos/fedora-coreos-tracker#481.

dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this issue May 15, 2020

overrides: fast-track rpm-ostree 2020.2

ece819d

For coreos/fedora-coreos-tracker#481.

dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this issue May 15, 2020

overrides: fast-track rpm-ostree 2020.2

c2a9507

For coreos/fedora-coreos-tracker#481.

dustymabe mentioned this issue May 18, 2020

testing: new release on 2020-05-19 (31.20200517.2.0) coreos/fedora-coreos-streams#99

Closed

33 tasks

lucab mentioned this issue May 19, 2020

updates: stop all rollouts coreos/fedora-coreos-streams#105

Merged

dustymabe mentioned this issue May 20, 2020

F32 rebase tracker for changes discussion #372

Closed

dustymabe closed this as completed May 20, 2020

dustymabe added the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label May 20, 2020

lucab mentioned this issue Jun 1, 2020

stable: new release on 2020-06-02 (31.20200517.3.0) coreos/fedora-coreos-streams#107

Closed

33 tasks

dustymabe mentioned this issue Jun 2, 2020

updates: stop all rollouts coreos/fedora-coreos-streams#110

Merged

dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Jun 4, 2020

dustymabe mentioned this issue Apr 17, 2023

tests/kola: fixes for upgrade test for early next releases coreos/fedora-coreos-config#2382

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpm-ostree/deploy: downgrade logic mistakenly blocks some upgrades #481

rpm-ostree/deploy: downgrade logic mistakenly blocks some upgrades #481

lucab commented May 14, 2020 •

edited

Loading

jlebon commented May 14, 2020

jlebon commented May 14, 2020

lucab commented May 14, 2020

bgilbert commented May 14, 2020

jlebon commented May 14, 2020

lucab commented May 14, 2020 •

edited

Loading

dustymabe commented May 14, 2020

jlebon commented May 14, 2020

dustymabe commented May 14, 2020

jlebon commented May 14, 2020

lucab commented May 14, 2020

dustymabe commented May 14, 2020

jlebon commented May 14, 2020

jlebon commented May 15, 2020

dustymabe commented May 18, 2020

dustymabe commented May 18, 2020

dustymabe commented May 20, 2020

dustymabe commented Jun 4, 2020

lucab commented Jul 2, 2020 •

edited

Loading

rpm-ostree/deploy: downgrade logic mistakenly blocks some upgrades #481

rpm-ostree/deploy: downgrade logic mistakenly blocks some upgrades #481

Comments

lucab commented May 14, 2020 • edited Loading

jlebon commented May 14, 2020

jlebon commented May 14, 2020

lucab commented May 14, 2020

bgilbert commented May 14, 2020

jlebon commented May 14, 2020

lucab commented May 14, 2020 • edited Loading

dustymabe commented May 14, 2020

jlebon commented May 14, 2020

dustymabe commented May 14, 2020

jlebon commented May 14, 2020

lucab commented May 14, 2020

dustymabe commented May 14, 2020

jlebon commented May 14, 2020

jlebon commented May 15, 2020

dustymabe commented May 18, 2020

dustymabe commented May 18, 2020

dustymabe commented May 20, 2020

dustymabe commented Jun 4, 2020

lucab commented Jul 2, 2020 • edited Loading

lucab commented May 14, 2020 •

edited

Loading

lucab commented May 14, 2020 •

edited

Loading

lucab commented Jul 2, 2020 •

edited

Loading