update-agent: react to persistent deploy failure #286

lucab · 2020-05-14T17:44:21Z

This augments update-agent state-machine to prevent locking into a faulty
UpdateAvailable state.
In particular, rpm-ostree may not agree to deploy the currently
detected update target. After a finite amount of failed attempts,
Zincati now abandons its update target and goes back to the
NoNewUpdate state, polling Cincinnati again for a new target.

Ref: coreos/fedora-coreos-tracker#481
Closes: #288

New state-machine diagram:

bgilbert · 2020-05-14T18:38:07Z

I haven't attempted to review the code in any detail, but the change SGTM.

cgwalters

OK traced through this code a bit and re-familiarized myself with it. This change overall makes sense to me and the code looks good.

cgwalters · 2020-05-14T20:28:59Z

src/update_agent/mod.rs

+    /// is being abandoned.
+    fn deploy_attempted(&mut self, success: bool) -> Option<Release> {
+        // Maximum failed deploy attempts before declaring a persistent error.
+        const MAX_DEPLOY_ATTEMPTS: u8 = 60;


That seems like a lot - why not 3 or even 1?

Good catch, I botched this. It was supposed to be 60 minutes. Retrying every 5 minutes, the watermark here should be at 12.

This cap is a bit arbitrary, anything strictly larger than 1 would indeed work.
A single attempt instead would remove the re-entering edge. With the current ticking logic, it could result in an infinite fast-spinning loop where the agent is oscillating between NoNewUpdate and UpdateAvailable as fast as it can.

lucab · 2020-05-15T08:05:52Z

Ok, I manually tested this on top of the current testing rollout with a fine-tuned rollout wariness, and the agent is indeed able to recover and select the new edge once it appears:

[INFO ] starting update agent (zincati 0.0.11-alpha.0)
...
[INFO ] initialization complete, auto-updates logic enabled
[INFO ] target release '31.20200505.2.0' selected, proceeding to stage it
[ERROR] failed to stage deployment: rpm-ostree deploy failed: 
    error: Upgrade target revision 'a4395eae3e1844d806a79b5cf51d44e60c96c7ab261a715f5d7fd89584c6963b' with timestamp 'Wed May  6 16:42:59 2020' is chronologically older than current ...
[INFO ] target release '31.20200505.2.0' selected, proceeding to stage it
[ERROR] failed to stage deployment: rpm-ostree deploy failed:
    error: Upgrade target revision 'a4395eae3e1844d806a79b5cf51d44e60c96c7ab261a715f5d7fd89584c6963b' with timestamp 'Wed May  6 16:42:59 2020' is chronologically older than current ...
...
[WARN ] persistent deploy failure detected, target release '31.20200505.2.0' abandoned
[INFO ] target release '31.20200505.2.1' selected, proceeding to stage it
...

src/update_agent/actor.rs

src/update_agent/mod.rs

src/update_agent/actor.rs

src/update_agent/mod.rs

This updates actor logic to use variant discriminants when comparing states.

lucab · 2020-05-20T09:59:23Z

Pushed a fixup commit to address the comments.

jlebon

LGTM!

This augments actor state-machine to prevent locking into a faulty `UpdateAvailable` state. In particular, rpm-ostree may not agree to deploy the currently detected update target. After a finite amount of failed attempts, Zincati now abandons its update target and goes back to the `NoNewUpdate` state, polling Cincinnati again for a new target.

lucab added area/updates kind/enhancement labels May 14, 2020

lucab requested review from bgilbert and jlebon May 14, 2020 17:44

lucab mentioned this pull request May 14, 2020

rpm-ostree/deploy: downgrade logic mistakenly blocks some upgrades coreos/fedora-coreos-tracker#481

Closed

cgwalters approved these changes May 14, 2020

View reviewed changes

lucab force-pushed the ups/abandon-update branch from fab9350 to c29aaca Compare May 15, 2020 07:31

lucab changed the title ~~[RFC] update-agent: react to persistent deploy failure~~ update-agent: react to persistent deploy failure May 15, 2020

lucab mentioned this pull request May 15, 2020

systemd: Activative via zincati.timer, not by default #251

Closed

lucab added jira for syncing to jira and removed jira for syncing to jira labels May 15, 2020

jlebon reviewed May 19, 2020

View reviewed changes

src/update_agent/actor.rs Show resolved Hide resolved

src/update_agent/mod.rs Show resolved Hide resolved

src/update_agent/mod.rs Outdated Show resolved Hide resolved

src/update_agent/actor.rs Outdated Show resolved Hide resolved

src/update_agent/mod.rs Show resolved Hide resolved

update-agent: compare state by variant

ee2419c

This updates actor logic to use variant discriminants when comparing states.

lucab force-pushed the ups/abandon-update branch from c29aaca to 3c5beba Compare May 20, 2020 09:51

jlebon approved these changes May 20, 2020

View reviewed changes

lucab added 2 commits May 20, 2020 16:23

docs/images: update state-machine diagram

16598a0

lucab force-pushed the ups/abandon-update branch from 3c5beba to 16598a0 Compare May 20, 2020 16:27

lucab merged commit 10d32d9 into coreos:master May 20, 2020

lucab deleted the ups/abandon-update branch May 20, 2020 16:59

lucab modified the milestones: vNext, v0.0.11 May 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update-agent: react to persistent deploy failure #286

update-agent: react to persistent deploy failure #286

lucab commented May 14, 2020 •

edited

Loading

bgilbert commented May 14, 2020

cgwalters left a comment

cgwalters May 14, 2020

lucab May 15, 2020 •

edited

Loading

lucab commented May 15, 2020

lucab commented May 20, 2020

jlebon left a comment

update-agent: react to persistent deploy failure #286

update-agent: react to persistent deploy failure #286

Conversation

lucab commented May 14, 2020 • edited Loading

bgilbert commented May 14, 2020

cgwalters left a comment

Choose a reason for hiding this comment

cgwalters May 14, 2020

Choose a reason for hiding this comment

lucab May 15, 2020 • edited Loading

Choose a reason for hiding this comment

lucab commented May 15, 2020

lucab commented May 20, 2020

jlebon left a comment

Choose a reason for hiding this comment

lucab commented May 14, 2020 •

edited

Loading

lucab May 15, 2020 •

edited

Loading