Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update-agent: react to persistent deploy failure #286

Merged
merged 3 commits into from
May 20, 2020

Conversation

lucab
Copy link
Contributor

@lucab lucab commented May 14, 2020

This augments update-agent state-machine to prevent locking into a faulty
UpdateAvailable state.
In particular, rpm-ostree may not agree to deploy the currently
detected update target. After a finite amount of failed attempts,
Zincati now abandons its update target and goes back to the
NoNewUpdate state, polling Cincinnati again for a new target.

Ref: coreos/fedora-coreos-tracker#481
Closes: #288


New state-machine diagram:
diagram

@bgilbert
Copy link
Contributor

I haven't attempted to review the code in any detail, but the change SGTM.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK traced through this code a bit and re-familiarized myself with it. This change overall makes sense to me and the code looks good.

/// is being abandoned.
fn deploy_attempted(&mut self, success: bool) -> Option<Release> {
// Maximum failed deploy attempts before declaring a persistent error.
const MAX_DEPLOY_ATTEMPTS: u8 = 60;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a lot - why not 3 or even 1?

Copy link
Contributor Author

@lucab lucab May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I botched this. It was supposed to be 60 minutes. Retrying every 5 minutes, the watermark here should be at 12.

This cap is a bit arbitrary, anything strictly larger than 1 would indeed work.
A single attempt instead would remove the re-entering edge. With the current ticking logic, it could result in an infinite fast-spinning loop where the agent is oscillating between NoNewUpdate and UpdateAvailable as fast as it can.

@lucab lucab changed the title [RFC] update-agent: react to persistent deploy failure update-agent: react to persistent deploy failure May 15, 2020
@lucab
Copy link
Contributor Author

lucab commented May 15, 2020

Ok, I manually tested this on top of the current testing rollout with a fine-tuned rollout wariness, and the agent is indeed able to recover and select the new edge once it appears:

[INFO ] starting update agent (zincati 0.0.11-alpha.0)
...
[INFO ] initialization complete, auto-updates logic enabled
[INFO ] target release '31.20200505.2.0' selected, proceeding to stage it
[ERROR] failed to stage deployment: rpm-ostree deploy failed: 
    error: Upgrade target revision 'a4395eae3e1844d806a79b5cf51d44e60c96c7ab261a715f5d7fd89584c6963b' with timestamp 'Wed May  6 16:42:59 2020' is chronologically older than current ...
[INFO ] target release '31.20200505.2.0' selected, proceeding to stage it
[ERROR] failed to stage deployment: rpm-ostree deploy failed:
    error: Upgrade target revision 'a4395eae3e1844d806a79b5cf51d44e60c96c7ab261a715f5d7fd89584c6963b' with timestamp 'Wed May  6 16:42:59 2020' is chronologically older than current ...
...
[WARN ] persistent deploy failure detected, target release '31.20200505.2.0' abandoned
[INFO ] target release '31.20200505.2.1' selected, proceeding to stage it
...

@lucab lucab added jira for syncing to jira and removed jira for syncing to jira labels May 15, 2020
src/update_agent/actor.rs Show resolved Hide resolved
src/update_agent/mod.rs Show resolved Hide resolved
src/update_agent/mod.rs Outdated Show resolved Hide resolved
src/update_agent/actor.rs Outdated Show resolved Hide resolved
src/update_agent/mod.rs Show resolved Hide resolved
This updates actor logic to use variant discriminants when comparing
states.
@lucab
Copy link
Contributor Author

lucab commented May 20, 2020

Pushed a fixup commit to address the comments.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

lucab added 2 commits May 20, 2020 16:23
This augments actor state-machine to prevent locking into a faulty
`UpdateAvailable` state.
In particular, rpm-ostree may not agree to deploy the currently
detected update target. After a finite amount of failed attempts,
Zincati now abandons its update target and goes back to the
`NoNewUpdate` state, polling Cincinnati again for a new target.
@lucab lucab merged commit 10d32d9 into coreos:master May 20, 2020
@lucab lucab deleted the ups/abandon-update branch May 20, 2020 16:59
@lucab lucab modified the milestones: vNext, v0.0.11 May 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

agent: actor can get stuck on a faulty target update
4 participants