Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpm-ostree/deploy: downgrade logic mistakenly blocks some upgrades #481

Closed
lucab opened this issue May 14, 2020 · 19 comments
Closed

rpm-ostree/deploy: downgrade logic mistakenly blocks some upgrades #481

lucab opened this issue May 14, 2020 · 19 comments

Comments

@lucab
Copy link
Contributor

lucab commented May 14, 2020

Deployments:
* ostree://fedora:fedora/x86_64/coreos/testing
                   Version: 31.20200420.2.0 (2020-04-21T16:48:43Z)
                    Commit: 0098d20fe3cea0fbee2ecd5f6e9e7d3b1c45846b61b389bce8f37ddcba690bf5
              GPGSignature: Valid signature by 7D22D5867F2A4236474BF7B850CB390B3C3359C4

Currently, on the testing channel there are two rollouts:

  • 31.20200505.2.0 at 100%
  • 31.20200505.2.1 at 0% (which will start in a few minutes)

Right now, creating a new testing machine using an older OS image (31.20200420.2.0 in the example above), results in the following error in Zincati journal:

[INFO ] starting update agent (zincati 0.0.9)
...
[INFO ] initialization complete, auto-updates logic enabled
[INFO ] new release '31.20200505.2.0' selected, proceeding to stage it
[ERROR] failed to stage deployment: rpm-ostree deploy failed:
 error: Upgrade target
 revision 'a4395eae3e1844d806a79b5cf51d44e60c96c7ab261a715f5d7fd89584c6963b'
 with timestamp 'Wed May  6 16:42:59 2020'
 is chronologically older than current
 revision 'a047a3c97511d242bab1a9c136debcf1cf47981bea9deb7affa38fb202dae0b3'
 with timestamp 'Wed May 13 23:33:50 2020'; use --allow-downgrade to permit

Cross-referencing releases with their revisions, it seems that:

  • rpm-ostree status and Zincati agree that 31.20200420.2.0 is the booted deployment
  • Zincati is asking to update 31.20200420.2.0 -> 31.20200505.2.0
  • rpm-ostreed is detecting and blocking a downgrade 31.20200505.2.1 -> 31.20200505.2.0
@lucab lucab changed the title rpm-ostree: deploy downgrade logic mistakenly blocks some upgrades rpm-ostree/deploy: downgrade logic mistakenly blocks some upgrades May 14, 2020
@jlebon
Copy link
Member

jlebon commented May 14, 2020

I'm investigating this. I think it's an issue in libostree's handling of downgrade protection in combination with how rpm-ostree handles deploy revision=....

So this might have broken automatic updates. An easy manual workaround which still respects barriers is something like:

$ mkdir -p /run/zincati/config.d
$ echo 'updates.allow_downgrade = true' > /run/zincati/config.d/allow-downgrade.toml
$ systemctl restart zincati

@jlebon jlebon self-assigned this May 14, 2020
@jlebon
Copy link
Member

jlebon commented May 14, 2020

We also need to expand upgrade testing to include older revisions than just the previous release. That would've caught this.

@lucab
Copy link
Contributor Author

lucab commented May 14, 2020

For reference, the downgrade-blocking logic is active since Zincati 0.0.7: https://github.com/coreos/zincati/releases/tag/0.0.7.

We introduced a barrier on testing in the past for 30.20191014.1, but at that point Zincati was still at 0.0.5.

@bgilbert
Copy link
Contributor

So this might have broken automatic updates.

This should fix itself, though, right? Once a node is permitted to update to 31.20200505.2.1, won't it do so successfully?

@jlebon
Copy link
Member

jlebon commented May 14, 2020

So this might have broken automatic updates.

This should fix itself, though, right? Once a node is permitted to update to 31.20200505.2.1, won't it do so successfully?

Yeah, I should be more precise. I think this breaks automatic updates for nodes sitting before an update barrier and there are releases past the barrier. Since as Luca mentioned, we only had one barrier so far, and downgrade protection wasn't on back then, this is OK for now. But we need to resolve this before creating any new barriers (which we're contemplating doing for the f32 update: #480).

@lucab
Copy link
Contributor Author

lucab commented May 14, 2020

In some cases it may not fix itself. As per Zincati state-machine, once an update is detected the client keeps trying staging+finalizing it.

The machine in the original report is stuck in such state, even if now the newest rollout is available:

$ sudo socat - UNIX-CONNECT:/run/zincati/public/metrics.promsock
zincati_rpm_ostree_deploy_attempts_total 6
zincati_rpm_ostree_deploy_failures_total 6

@dustymabe
Copy link
Member

IIUC the way a client would get into a "stuck forever" state is by attempting an upgrade to N-1 when N is staged but not yet started to rollout. Practically that should be:

  • new nodes that were created with N-2 or older in the period of time when N is staged but not yet started to roll out.

Any auto updating nodes should have already been on N-1 so it should be mostly limited to the above mentioned scenario, correct?

@jlebon
Copy link
Member

jlebon commented May 14, 2020

In some cases it may not fix itself. As per Zincati state-machine , once an update is detected the client keeps trying staging+finalizing it.

Hmm, I wonder if it should go back to ReportedSteady instead if staging failed? Let's say e.g. rpm-ostree is hitting up against libostree's ENOSPC protection for a few days before I realize and clean up/expand storage, I think I'd expect my node to reboot into the latest update, not what the latest update was when the alert went off and possibly double reboot.

@dustymabe
Copy link
Member

  • new nodes that were created with N-2 or older in the period of time when N is staged but not yet started to roll out.

clarification/update:

  • new nodes that were created with N-2 or older in the period of time when N is staged but not yet 100% rolled out. i.e., the period of time in which the newly installed node could attempt to upgrade to N-1, instead of N.

jlebon added a commit to jlebon/rpm-ostree that referenced this issue May 14, 2020
Both libostree and rpm-ostree support downgrade protection. But what
that means is different between the two. For libostree, downgrade
protection means not fetching any commit which is older than what the
current ref is pointing at. For rpm-ostree, it means not *deploying*
any commit which is older than what the current *deployment* is on.

These two are mostly the same most of the time, but can differ. For
example, on a remote ref which has commits A -> B -> C, where the client
is sitting on a deployment from A, downgrade protection should not
prevent the client from upgrading to B even if there is a newer commit
C.

Since there is no hard relation enforced between what the state of the
OSTree ref is locally and deployments (e.g. we fully support users
manually doing `ostree pull`), it doesn't make sense to compare against
the tip of the ref. Instead, use the new `timestamp-check-from-rev`
to tell libostree to compare against our base revision, which is what we
care about.

Closes: coreos/fedora-coreos-tracker#481
jlebon added a commit to jlebon/ostree that referenced this issue May 14, 2020
The way `timestamp-check` works might be too restrictive in some
situations. Essentially, we need to support the case where users want to
pull an older commit than the current tip, but while still guaranteeing
that it is newer than some even older commit.

This will be used in Fedora CoreOS. For more information see:
coreos/rpm-ostree#2094
coreos/fedora-coreos-tracker#481
@jlebon
Copy link
Member

jlebon commented May 14, 2020

This is fixed with ostreedev/ostree#2099 and coreos/rpm-ostree#2094!

@lucab
Copy link
Contributor Author

lucab commented May 14, 2020

A tentative enhancement on Zincati side is at coreos/zincati#286.

jlebon added a commit to jlebon/ostree that referenced this issue May 14, 2020
The way `timestamp-check` works might be too restrictive in some
situations. Essentially, we need to support the case where users want to
pull an older commit than the current tip, but while still guaranteeing
that it is newer than some even older commit.

This will be used in Fedora CoreOS. For more information see:
coreos/rpm-ostree#2094
coreos/fedora-coreos-tracker#481
jlebon added a commit to jlebon/rpm-ostree that referenced this issue May 14, 2020
Both libostree and rpm-ostree support downgrade protection. But what
that means is different between the two. For libostree, downgrade
protection means not fetching any commit which is older than what the
current ref is pointing at. For rpm-ostree, it means not *deploying*
any commit which is older than what the current *deployment* is on.

These two are mostly the same most of the time, but can differ. For
example, on a remote ref which has commits A -> B -> C, where the client
is sitting on a deployment from A, downgrade protection should not
prevent the client from upgrading to B even if there is a newer commit
C.

Since there is no hard relation enforced between what the state of the
OSTree ref is locally and deployments (e.g. we fully support users
manually doing `ostree pull`), it doesn't make sense to compare against
the tip of the ref. Instead, use the new `timestamp-check-from-rev`
to tell libostree to compare against our base revision, which is what we
care about.

Closes: coreos/fedora-coreos-tracker#481
jlebon added a commit to jlebon/rpm-ostree that referenced this issue May 14, 2020
Both libostree and rpm-ostree support downgrade protection. But what
that means is different between the two. For libostree, downgrade
protection means not fetching any commit which is older than what the
current ref is pointing at. For rpm-ostree, it means not *deploying*
any commit which is older than what the current *deployment* is on.

These two are mostly the same most of the time, but can differ. For
example, on a remote ref which has commits A -> B -> C, where the client
is sitting on a deployment from A, downgrade protection should not
prevent the client from upgrading to B even if there is a newer commit
C.

Since there is no hard relation enforced between what the state of the
OSTree ref is locally and deployments (e.g. we fully support users
manually doing `ostree pull`), it doesn't make sense to compare against
the tip of the ref. Instead, use the new `timestamp-check-from-rev`
to tell libostree to compare against our base revision, which is what we
care about.

Closes: coreos/fedora-coreos-tracker#481
jlebon added a commit to jlebon/rpm-ostree that referenced this issue May 14, 2020
Both libostree and rpm-ostree support downgrade protection. But what
that means is different between the two. For libostree, downgrade
protection means not fetching any commit which is older than what the
current ref is pointing at. For rpm-ostree, it means not *deploying*
any commit which is older than what the current *deployment* is on.

These two are mostly the same most of the time, but can differ. For
example, on a remote ref which has commits A -> B -> C, where the client
is sitting on a deployment from A, downgrade protection should not
prevent the client from upgrading to B even if there is a newer commit
C.

Since there is no hard relation enforced between what the state of the
OSTree ref is locally and deployments (e.g. we fully support users
manually doing `ostree pull`), it doesn't make sense to compare against
the tip of the ref. Instead, use the new `timestamp-check-from-rev`
to tell libostree to compare against our base revision, which is what we
care about.

Closes: coreos/fedora-coreos-tracker#481
@dustymabe
Copy link
Member

Thanks @jlebon and @lucab for jumping on this problem and doing a root cause so quickly this morning (EDT) from the limited information I had available. Now that we have a bead on fixing this problem for future deployments I'd like to brainstorm if there is any possible way we can fix it without human intervention for existing systems who are currently "stuck" in the "attempting to upgrade" state.

One idea I have is to:

  • change cincinnati to not provide any update edges temporarily
    • systems in non-"stuck" state won't be given any update paths
  • methodically modify our OSTree server repo to make the ref for each stream be {N-N}...N for a brief period of time (30 minutes each maybe?)
    • foreach ver in {N-N}..N; set ref to $commitforver; sleep 30m; done
  • change cincinnati back to serving upgrade edges.

This would mean for a period of time we would no longer fail the "downgrade protection" check. It looks like zincati retries every 5 minutes so 30 minutes should be plenty I would think.

In order to not have a long period where we aren't serving update edges via cincinnati we could spread out this across several days or weeks (i.e. just do a few per day or something).

@jlebon jlebon added the jira for syncing to jira label May 14, 2020
@jlebon
Copy link
Member

jlebon commented May 14, 2020

This would mean for a period of time we would no longer fail the "downgrade protection" check. It looks like zincati retries every 5 minutes so 30 minutes should be plenty I would think.

We discussed this on IRC, but essentially: this won't work around the downgrade protection check because those nodes would've already fetched the latest ref and that's what downgrade protection checks against.

So looking at the last 4 stable releases here and their Zincati versions, we have:

  • 31.20200420.3.0: 0.0.9
  • 31.20200407.3.0: 0.0.9
  • 31.20200323.3.2: 0.0.9
  • 31.20200310.3.0: 0.0.6

So there have been two rollouts so far where Zincati 0.0.9 (which has downgrade protection) was used. But since nodes need to be at N-2 or older to hit this, I think the only time window during which a node could enter the Zincati deadend state are those started at 31.20200323.3.2 which witnessed the 31.20200420.3.0 rollout and were given an edge to 31.20200407.3.0. That rollout lasted 48h, during which it became progressively less likely that you'd be handed out that edge.

Looking at testing, the problem is worse. The first release with 0.0.9 was 31.20200323.2.0, and there have been 5 rollouts since (which means 4 rollout windows during which if you brought up a node older than N-1 but newer or at 31.20200323.2.0, you could enter the deadend state depending on the edge Cincinnati gave you).

It'd be good to have some way to measure how many nodes are stuck there. The good news is that for stable it seems like a pretty limited exposure.

That said, we shouldn't add any more rollouts without ostreedev/ostree#2099 and coreos/rpm-ostree#2094 (and/or coreos/zincati#286) in so we don't increase further the number of starting nodes that are susceptible to this.

jlebon added a commit to coreos/fedora-coreos-streams that referenced this issue May 15, 2020
As a mild mitigation for
coreos/fedora-coreos-tracker#481, let's remove
the previous rollout from the graph so that no more nodes get into that
state.
openshift-merge-robot pushed a commit to coreos/rpm-ostree that referenced this issue May 15, 2020
This is a short-term hack until we can depend on the new
`timestamp-check-from-rev` from ostree:

ostreedev/ostree#2099

That way, we still get downgrade protection, but wrt the checked out
deployment, not the local ref.

For more information, see
#2094
coreos/fedora-coreos-tracker#481
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue May 15, 2020
jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue May 15, 2020
@jlebon
Copy link
Member

jlebon commented May 15, 2020

New rpm-ostree with fix: coreos/fedora-coreos-config#401
Though we should keep this open until we expand our tests.

jlebon added a commit to jlebon/fedora-coreos-config that referenced this issue May 15, 2020
dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this issue May 15, 2020
dustymabe pushed a commit to coreos/fedora-coreos-config that referenced this issue May 15, 2020
@dustymabe
Copy link
Member

We discussed this on IRC, but essentially: this won't work around the downgrade protection check because those nodes would've already fetched the latest ref and that's what downgrade protection checks against.

@jlebon correct me if I'm wrong here but during our video meeting on Friday I think we looked at the code and it turns out we do think #481 (comment) would work because when fetching the ref we don't consider downgrade protection.

I don't know if it's something we'll do but it's an option.

@dustymabe
Copy link
Member

@lucab - @jlebon and I chatted more towards the end of the day on Friday. We had the option of backporting just the fix for rpm-ostree into a testing release to do today (Monday) and then later in the week doing a stable release based on that testing release. We hit a complication because the rpm-ostree status in dist-git had moved past the rpm-ostree version that was already in FCOS stable/testing; we did a snapshot build recently for COSA. This means we'd have to make a backport branch in order to do the backport of just the fix for rpm-ostree, which we could easily do, but...

Rather than doing all of that, since it was late on Friday, we decided to just move ahead with the newly released version of rpm-ostree (including the fix and new features) into testing-devel and then forge ahead with the round of releases for this week as planned. This means this week's testing release will contain the new release of rpm-ostree, but the stable release will not because we want to soak test the new release of rpm-ostree like we normally do.

We think this is OK because we know how to mitigate this problem server side by removing the "to N-1" upgrade edge in the graph once we've started to release N. So we'll just continue to do that for a period of time until we think enough people have moved on from the affected FCOS starting points.

@dustymabe
Copy link
Member

This landed upstream in rpm-ostree and a new rpm-ostree release was done.

The fix for this went into testing stream release 31.20200517.2.0.

@dustymabe
Copy link
Member

The fix for this went into stable stream release 31.20200517.3.0.

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Jun 4, 2020
@lucab
Copy link
Contributor Author

lucab commented Jul 2, 2020

For anybody that has a machine which is blocked in this state, it is possible to manually allow downgrades for the current boot in order to get back into the normal auto-updates flow:

echo 'updates.allow_downgrade = true' | sudo tee /run/zincati/config.d/99-issues-481.toml
sudo systemctl restart zincati

After the first successful upgrade, the configuration fragment will be automatically cleaned (with the rest of /run).

dustymabe added a commit to dustymabe/fedora-coreos-config that referenced this issue Apr 17, 2023
Older FCOS will complain about an upgrade target being 'chronologically older than current'
This is documented in coreos/fedora-coreos-tracker#481
We'll workaround the problem here in a config dropin.
dustymabe added a commit to coreos/fedora-coreos-config that referenced this issue Apr 17, 2023
Older FCOS will complain about an upgrade target being 'chronologically older than current'
This is documented in coreos/fedora-coreos-tracker#481
We'll workaround the problem here in a config dropin.
c4rt0 pushed a commit to c4rt0/fedora-coreos-config that referenced this issue May 17, 2023
Older FCOS will complain about an upgrade target being 'chronologically older than current'
This is documented in coreos/fedora-coreos-tracker#481
We'll workaround the problem here in a config dropin.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Older FCOS will complain about an upgrade target being 'chronologically older than current'
This is documented in coreos/fedora-coreos-tracker#481
We'll workaround the problem here in a config dropin.
HuijingHei pushed a commit to HuijingHei/fedora-coreos-config that referenced this issue Oct 10, 2023
Older FCOS will complain about an upgrade target being 'chronologically older than current'
This is documented in coreos/fedora-coreos-tracker#481
We'll workaround the problem here in a config dropin.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants