Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

service: exit and prevent restarts if booted into a container image #896

Merged
merged 1 commit into from
Nov 21, 2022

Conversation

cgwalters
Copy link
Member

This is part of coreos/fedora-coreos-tracker#1263

We don't yet have an official stance on how zincati and custom container images interact. Today, zincati just crash loops. This changes things so that we gracefully exit if we detect the booted system is using a container image origin.

(The code here isn't quite as clean as it could be; calling
std::process::exit() in the middle of the call chain isn't
elegant but doing better would require plumbing through an
Option<T> through many layers)

@lucab
Copy link
Contributor

lucab commented Nov 21, 2022

Thanks for assembling this patch, I like this direction. It sounds like we are now generally trying to address the case of Zincati somehow running on unexpected OS/state, which is a good idea from the point of view of UX.

One of the goals here seem to be avoiding an infinite restart loop. I think systemd has a dedicated RestartPreventExitStatus= for that exact purpose, but I've personally never used it.
If it allows us to both prevent the restart-loop but also record a failed state and a non-zero exit code, I think it is worth to switch to that.

Aside, I personally think it would be a good time to try to plumb an "error object with optional exit code" till here:

zincati/src/main.rs

Lines 58 to 61 in ebb241c

Err(e) => {
log_error_chain(e);
libc::EXIT_FAILURE
}

If you don't want to waste too much time on that, I can give it a try using anyhow primitives/downcasting, then add a commit on top here for that.

@cgwalters
Copy link
Member Author

So, automatic updates on by default are core to the CoreOS philosophy. Is your rationale behind having zincati explicitly fail (and hence that having that failure at least minimally appear via CLHM and other automation tools) to represent this state in a very visible way?

It might be of course that admins have set up something to upgrade the system not via zincati; it's easy to have an external orchestration system effectively just run rpm-ostree upgrade targeting a tag or even an explicit digest.

In that scenario, the user can mask zincati and hence mask the spurious failure.

On the other hand, what I was thinking with this patch is that their choice to use an image also kind of signals such an intent.

This is part of coreos/fedora-coreos-tracker#1263

We don't yet have an official stance on how zincati and custom
container images interact.  Today, zincati just crash loops.
This changes things so that we exit (but still with an error) if we detect
the booted system is using a container image origin.

One nicer thing here is that the unit status is also updated, e.g.
`systemctl status zincati` will show:

`Status: "Automatic updates disabled; booted into container image ..."`
@cgwalters
Copy link
Member Author

OK, I reworked this a bit to use a custom error type and still exit with an error.

@lucab
Copy link
Contributor

lucab commented Nov 21, 2022

I think so. Overall I don't think that "booted from a containerized commit" and "user wants to opt-out auto-updates" are two things that we should conflate at this time. Instead I'd scope this PR to "Zincati inspected this booted commit and doesn't know how to auto-update it".

If Zincati manages to start it can surface different kind of failures (both transient and permanent) through metrics to the monitoring infra. But if it fails to complete startup for whatever reason, it needs some other kind of system to expose this failure state.

Auto-updates are critical enough that "keep restarting" is often a good solution to eventually bypass any transient error. But in this case we know we can't get out of this situation without manual intervention, so I think it's fair to stay in a perma-failed state.

@lucab lucab changed the title Exit if booted into a container image service: exit and prevent restarts if booted into a container image Nov 21, 2022
if let Some(img) = status.container_image_reference.as_ref() {
let msg = format!("Automatic updates disabled; booted into container image {img}");
crate::utils::update_unit_status(&msg);
return Err(anyhow::Error::new(FatalError(msg)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more expecting this additional step to be performed in the error case in fedora_coreos_stream_from_deployment() (so that both cases would prevent restarts), but I guess just here it could work as well.

@cgwalters
Copy link
Member Author

For sure this is not the final word on this...I am now realizing how much work it will be to plumb through all the data into Cincinnati etc. around the container image metadata.

There's an argument that we should still crash-loop - that would handle the case where a user rebases back to an ostree commit today live. (But OTOH the most likely scenario is to reboot there anyways which will then trigger zincati retries)

@cgwalters
Copy link
Member Author

Overall I don't think that "booted from a containerized commit" and "user wants to opt-out auto-updates" are two things that we should conflate at this time. Instead I'd scope this PR to "Zincati inspected this booted commit and doesn't know how to auto-update it".

Agree, this PR is just having a better error in the latter case.

@cgwalters cgwalters merged commit 0203478 into coreos:main Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants