updates: new strategy based on local filesystem #245

lucab · 2020-03-10T09:39:30Z

One interesting idea that came out of #204 (comment) is:

I'd still try to come up with a flow which does not completely bypass Zincati finalization (for example, giving permission to reboot only if a specific filepath exists) and which does not require SSHing to each node.

The idea is to have some kind of logic on each node to touch a file when finalization is allowed, and remove it when it is not allowed.
The controller can be a containerized agent, or some central task manager able to manipulate files on machines, or even a human via SSH (not recommended).

We won't provide the file-creation, only the updates-strategy in Zincati. Strategy name still to be decided.

cgwalters · 2021-03-19T14:25:36Z

This case feels like it overlaps at least somewhat with systemd inhibitors - any process that doesn't want the system to reboot can use those today.

lucab · 2021-04-16T16:12:06Z

Bunch of self-notes.

Removing a file to signal the end of the allowed finalization window is a critical step which may not be feasible at all times (e.g. because network/SSH/whatever is temporarily down). For this reason, there should be a way to encode an optional "not-after" timestamp so that windows can safely auto-expire.

Other strategies like periodic allow multiple reboots to happen in a single window (e.g. in case of barriers), so this signaling file should be placed under a persistent path like /var.
However some other strategies like fleet_lock are only valid for a single finalization, so maybe there should be a knob in the strategy configuration to clean the file right before finalization.

While the idea partially overlaps with https://www.freedesktop.org/software/systemd/man/systemd-inhibit.html, it diverges enough in the semantics to the point that I think it's worth designing a separate flow. To that extent, it is possibly more similar to a "persisted time-bound flock".

kelvinfan001 · 2021-04-21T01:16:43Z

Rationale

After some discussion with @lucab, we agreed that there is demand for having more manual control over Zincati updates. This could be achieved by giving users more fine-grained control over when update finalizations (reboots) are performed, and would thus be a compromise between full manual control over updates (#498) and fully automatic updates. Giving users the ability to more directly control when finalizations (reboots) are allowed should ideally divert most of the demand for fully manual updates (in most scenarios, if there is an update available, it should already be staged pretty quickly by Zincati). The idea here is to discourage users from ever needing to SSH into individual nodes to perform upgrades.

A low level approach, such as checking for a file on the filesystem, should be flexible enough to address the need for manually controlling reboot windows. Advantages are that files can always be written even when Zincati is not running, and files are generally easier to manage from different environments since the only thing required is access to the specific filesystem directory (e.g. from a container bind-mount, scp, etc.) and is easily scriptable.

This is similar in many ways to the fleet_lock strategy. Both strategies provide a large amount of flexibility to the user. The main difference is that this proposed new strategy uses the "server-push architecture", where the server could be using e.g. Ansible and have ssh access to FCOS nodes (of course, this new strategy could also work in the case of a human manually ssh'ing into the node and placing a file); whereas fleet_lock uses the "client-pull architecture", where the Zincati client uses the HTTP protocol to "pull" information from a lock-manager to get information on whether a reboot is allowed. With both, admins should have plenty of flexibility in coordinating FCOS reboots.

Note: this is not a replacement for a "proper" alternative to rpm-ostree upgrade (#498), i.e. this strategy should ideally discourage users from needing to do do manual upgrades; it does not provide a mechanism for direct manual upgrades. If after this strategy is implemented and there is still demand for direct manual upgrades, e.g. using periodic and an admin would like to upgrade immediately to the absolute newest release, instead of wait until a reboot window and receive possibly not the most up-to-date release due to rollout wariness, then we would need to build a separate mechanism, possibly using the D-Bus server, to satisfy that need.

Proposal

Mechanism

During regular finalization attempts (tick_finalize_update()), Zincati will check the specified location and try to read the file at that location. If the file exists and the contents of the file matches the specified content (contains at least information about the expiry datetime of the file), then the finalization is allowed and a reboot will follow.
Introduce an accompanying filesystem strategy; if this strategy is used, the only way to finalize updates would be to use this new file-based mechanism; i.e. this strategy defaults to no reboots allowed. ~~This mechanism can be used alongside e.g. fleetlock or periodic to allow more manual control.~~
At least initially, the file will not be removed before a reboot (in the future, we can extend the fields in the JSON object and allow Zincati to remove files). This means that consecutive reboots can happen once a file is placed. As @lucab mentioned, this will differ slightly from the otherwise similar fleet_lock strategy where only a single reboot/finalization is allowed.

Location

persistent location under /var/lib/zincati/allowfinalize: this is so reboots can happen consecutively to update across multiple barrier updates if any.

Future considerations / Alternative

Introduce a slightly higher level mechanism that wraps the above around a CLI or D-Bus method, e.g. busctl call zincati start-window-now notAfter=xyz, busctl call zincati stop-window-now. This way, it is more user-friendly and has a smaller chance of human error due to e.g. wrong file permissions, incorrect location, wrong format. Essentially, this file-based mechanism could be used as the implementation detail of a possible higher level strategy. This is, however, less flexible than the above.

cgwalters · 2021-05-19T13:34:37Z

While the idea partially overlaps with https://www.freedesktop.org/software/systemd/man/systemd-inhibit.html, it diverges enough in the semantics to the point that I think it's worth designing a separate flow. To that extent, it is possibly more similar to a "persisted time-bound flock".

Can you elaborate on this? We should try to enumerate some use cases, but I think many if not most of them will also want to inhibit reboots for other reasons.

cgwalters · 2021-05-19T15:21:25Z

We had a realtime chat on this and I think my core argument is: zincati should monitor systemd for "block" locks (not "delay") and not even try to finalize if one is active, because what will happen is we'll finalize but be blocked on reboot which is exactly what we don't want.

cgwalters · 2021-05-19T16:37:24Z

Arguably if we had this, we could try to train people doing interactive ssh logins to use systemd-inhibit --what=shutdown bash instead of zincati watching logind. But...eh.

kelvinfan001 · 2021-05-25T17:56:34Z

We had a realtime chat on this and I think my core argument is: zincati should monitor systemd for "block" locks (not "delay") and not even try to finalize if one is active, because what will happen is we'll finalize but be blocked on reboot which is exactly what we don't want.

@cgwalters should we implement the monitoring on the rpm-ostree side instead? This would seem more natural to me. Perhaps this would also make it slightly less racy (but still racy nonetheless) since rpm-ostree is the one that actually calls systemctl reboot.

cgwalters · 2021-05-25T19:45:08Z

What would happen though when rpm-ostree finalize-deployment is called when a block inhibitor is held by another process? Would we error out? Block? (This debate mirrors the systemd debate on this)

Actually either way we choose though, zincati should probably know not to try to finalize+update - which would then mean we'd need an rpm-ostree API to proxy the state, or for zincati to monitor it too...

It seems actually simpler to have this logic in zincati.

The way I'm thinking of this now for example, I think we should do a similar thing in the MCO: openshift/machine-config-operator#2163 (comment)
It's hard to avoid "pushing out" the monitoring here to the "end component".

kelvinfan001 · 2021-05-25T19:57:03Z

I was thinking rpm-ostree finalize-deployment could error out. This would fit at least Zincati's purposes. Zincati will retry at its regular cadence of about 5 minutes if the CLI call fails. So for Zincati specifically, it seems to me that we don't really need Zincati to monitor it.

cgwalters · 2021-05-25T21:19:11Z

In a fleet lock scenario (much like the MCO) what I think we want here is for the updater to avoid held nodes, not to pick one and keep trying to finalize until it unblocks, right?

There's also a power/CPU efficiency argument here around edge triggering on when a block is lifted versus polling effectively.

OTOH, I understand retries may fit in better to the zincati state machine.

cgwalters · 2021-05-25T21:21:33Z

I was thinking rpm-ostree finalize-deployment could error out.

That said I agree with this; particularly since only Zincati uses it right now, and making that change doesn't conflict with having zincati do the monitoring either.

kelvinfan001 · 2021-05-25T21:30:21Z

In a fleet lock scenario (much like the MCO) what I think we want here is for the updater to avoid held nodes, not to pick one and keep trying to finalize until it unblocks, right?

Ahh I see, admittedly I hadn't thought of this. But yes, this makes total sense. I agree that "end components" like Zincati/MCO should have their own monitoring; for Zincati's case, it'd want to communicate that to fleet_lock.

But additionally, I'm assuming we still want a quick check logic in rpm-ostree, almost mirroring the functionality of systemd's --check-inhibitors, except we tweak it slightly to filter out delay locks and also we don't need to wait for systemd v248. This way, we at least make it also possible to inhibit rpm-ostree using systemd inhibitors (e.g. rpm-ostree upgrade -r respects inhibitors).

cgwalters · 2021-05-25T22:05:36Z

Agreed!

dustymabe · 2022-02-22T15:12:56Z

This came up in a discussion today with the podman team. For users of podman machine on a desktop environment they would like a way to notify users an update exists and is staged but not allow the update to continue until the user clicks "OK, do update". One way to implement that would be to place a file telling the machine not to continue the update. Or maybe something like #498 would be better here, but podman-machine would still need to know an update was ready.

lucab · 2022-02-22T15:45:57Z

@dustymabe et al., the podman-specific usecase is tracked at #539. It is currently missing the actual requirements/constraints in order to design an effective solution for that. See my initial reply there.

dustymabe · 2022-02-22T18:28:53Z

Thank you @lucab for pointing me in the right direction.

lucab added area/updates kind/new-feature labels Mar 10, 2020

lucab mentioned this issue Mar 10, 2020

New update strategy #204

Closed

lucab mentioned this issue Mar 31, 2020

Uploading to cloud platforms: GCP coreos/fedora-coreos-tracker#147

Closed

lucab mentioned this issue Jun 4, 2020

updates/strategy: local timezone aware reboot windows #301

Closed

lucab mentioned this issue Jun 18, 2020

agent: delay reboot if ongoing interactive sessions #115

Closed

lucab mentioned this issue Apr 16, 2021

RFE: Allow manual update checks and reboots #498

Open

kelvinfan001 added the jira for syncing to jira label Apr 22, 2021

kelvinfan001 mentioned this issue May 17, 2021

strategy: add new marker_file strategy #540

Closed

kelvinfan001 linked a pull request May 20, 2021 that will close this issue

strategy: add new marker_file strategy #540

Closed

kelvinfan001 mentioned this issue May 22, 2021

Investigate Zincati's compatibility with systemd inhibitor locks #562

Closed

cgwalters mentioned this issue May 24, 2021

daemon: Make actually initiating reboot asynchronous coreos/rpm-ostree#2848

Merged

cgwalters mentioned this issue May 25, 2021

node-controller: Support an annotation to hold updates openshift/machine-config-operator#2163

Closed

travier mentioned this issue Sep 1, 2021

Feature Request: Make possible to pin target release coreos/fedora-coreos-tracker#947

Open

jokajak mentioned this issue Oct 9, 2023

strategy: add new marker_file strategy #1103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updates: new strategy based on local filesystem #245

updates: new strategy based on local filesystem #245

lucab commented Mar 10, 2020

cgwalters commented Mar 19, 2021

lucab commented Apr 16, 2021 •

edited

Loading

kelvinfan001 commented Apr 21, 2021 •

edited

Loading

cgwalters commented May 19, 2021

cgwalters commented May 19, 2021

cgwalters commented May 19, 2021

kelvinfan001 commented May 25, 2021

cgwalters commented May 25, 2021

kelvinfan001 commented May 25, 2021

cgwalters commented May 25, 2021

cgwalters commented May 25, 2021

kelvinfan001 commented May 25, 2021

cgwalters commented May 25, 2021

dustymabe commented Feb 22, 2022

lucab commented Feb 22, 2022

dustymabe commented Feb 22, 2022

updates: new strategy based on local filesystem #245

updates: new strategy based on local filesystem #245

Comments

lucab commented Mar 10, 2020

cgwalters commented Mar 19, 2021

lucab commented Apr 16, 2021 • edited Loading

kelvinfan001 commented Apr 21, 2021 • edited Loading

Rationale

Proposal

Mechanism

Location

Contents

Future considerations / Alternative

cgwalters commented May 19, 2021

cgwalters commented May 19, 2021

cgwalters commented May 19, 2021

kelvinfan001 commented May 25, 2021

cgwalters commented May 25, 2021

kelvinfan001 commented May 25, 2021

cgwalters commented May 25, 2021

cgwalters commented May 25, 2021

kelvinfan001 commented May 25, 2021

cgwalters commented May 25, 2021

dustymabe commented Feb 22, 2022

lucab commented Feb 22, 2022

dustymabe commented Feb 22, 2022

lucab commented Apr 16, 2021 •

edited

Loading

kelvinfan001 commented Apr 21, 2021 •

edited

Loading