zincati service fails to start if non-conforming ostree deployment exists #859

dustymabe · 2022-10-11T19:20:14Z

Bug Report

I recently was experimenting with the quay.io/fedora/fedora-coreos:next-devel container on one of my systems. I rebased it using the following command and left it like that for a few releases (manually running rpm-ostree upgrade each cycle).

sudo rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/fedora/fedora-coreos:next-devel

I then decided to go back to having automatic updates working (zincati + OSTree repo) so I rebased back to what I had to begin with:

sudo rpm-ostree rebase fedora:fedora/x86_64/coreos/next

and ended up with:

[core@weevm ~]$ rpm-ostree status 
State: idle
Deployments:
● fedora:fedora/x86_64/coreos/next
                  Version: 37.20221003.1.0 (2022-10-03T17:40:38Z)
                   Commit: df296c944c11261f69f0e3b26d04b7e4016e3f5dd38cd3cc4cd05deab72f7474
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A

  ostree-unverified-registry:quay.io/fedora/fedora-coreos:next-devel
                   Digest: sha256:10e906cc2514a8098343eb8e813d59e549af47fb43ecbc2345f3c40a0d3ba702
                Timestamp: 2022-09-29T19:39:44Z

However I noticed zincati fails to start now:

Oct 11 19:09:43 weevm systemd[1]: Starting zincati.service - Zincati Update Agent...
Oct 11 19:09:43 weevm zincati[442167]: [INFO  zincati::cli::agent] starting update agent (zincati 0.0.24)
Oct 11 19:09:43 weevm zincati[442167]: [ERROR zincati] critical error: failed to assemble configuration settings
Oct 11 19:09:43 weevm zincati[442167]: [ERROR zincati]  -> failed to validate agent identity configuration
Oct 11 19:09:43 weevm zincati[442167]: [ERROR zincati]  -> failed to build default identity
Oct 11 19:09:43 weevm zincati[442167]: [ERROR zincati]  -> missing field `coreos-assembler.basearch` at line 77 column 7
Oct 11 19:09:43 weevm systemd[1]: zincati.service: Main process exited, code=exited, status=1/FAILURE
Oct 11 19:09:43 weevm systemd[1]: zincati.service: Failed with result 'exit-code'.
Oct 11 19:09:43 weevm systemd[1]: Failed to start zincati.service - Zincati Update Agent.
Oct 11 19:09:54 weevm systemd[1]: zincati.service: Scheduled restart job, restart counter is at 52512.
Oct 11 19:09:54 weevm systemd[1]: Stopped zincati.service - Zincati Update Agent.

This is because there is apparently some data missing about the container deployment that causes zincati to barf and not continue best effort. I would assume that since the booted deployment is good zincati should be able to continue.

After cleaning up the rollback deployment with sudo rpm-ostree rollback -r zincati was able to start.

Environment

QEMU - FCOS at 37.20221003.1.0

Expected Behavior

Able to start zincati.

Actual Behavior

zincati fails to start up.

Reproduction Steps

boot system
sudo rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/fedora/fedora-coreos:next and reboot
sudo rpm-ostree rebase fedora:fedora/x86_64/coreos/next and reboot
zincati.service should fail to start now because rollback deployment doesn't conform.

Other Information

The text was updated successfully, but these errors were encountered:

lucab · 2022-10-12T09:09:07Z

Thanks for the report and the lengthy reproducer. I do agree this is an unexpected behavior, Zincati should be able to proceed based on the metadata on the booted deployment.

Looking into the inner logic, I think you are actually hitting a bug in rpm-ostree which is not properly handing the combination of --json and --booted in its status output. If you look at your node while in the intermediate rollback state, you'll observe this:

$ rpm-ostree status --json --booted | jq '.deployments | length'
2

Thus Zincati sees two deployments and tries to deserialize both of them. The "is-booted" filtering would happen after that, but the containerized deployment cannot be parsed as it is missing some of the required fields.

Searching through the bug tracker, this has already been reported in the past: coreos/rpm-ostree#2829.

dustymabe · 2022-10-12T13:36:28Z

Thanks for looking into it @lucab - I guess we should try to prioritize that rpm-ostree issue if we want to fix this problem.

jlebon · 2022-10-12T13:42:11Z

Though ideally it seems like the metadata Zincati is looking for should be retained even in the container case.

dustymabe · 2022-10-12T13:49:02Z

Though ideally it seems like the metadata Zincati is looking for should be retained even in the container case.

Indeed that is another thing that comes out of this that we should fix.

lucab · 2022-10-12T13:52:01Z

I do agree with all the statements above.

Additionally, I think it would be good to improve Zincati logic anyway to make it slightly more resilient against this kind of local unexpected state. Right now we do expect all of the following fields:

zincati/src/rpm_ostree/cli_status.rs

Lines 46 to 68 in b8cf2ed

    
           /// Partial deployment object (only fields relevant to zincati). 
        
           #[derive(Clone, Debug, Deserialize)] 
        
           #[serde(rename_all = "kebab-case")] 
        
           pub struct DeploymentJson { 
        
               booted: bool, 
        
               base_checksum: Option<String>, 
        
               #[serde(rename = "base-commit-meta")] 
        
               base_metadata: BaseCommitMetaJson, 
        
               checksum: String, 
        
               // NOTE(lucab): missing field means "not staged". 
        
               #[serde(default)] 
        
               staged: bool, 
        
               version: String, 
        
           } 
        
           /// Metadata from base commit (only fields relevant to zincati). 
        
           #[derive(Clone, Debug, Deserialize)] 
        
           struct BaseCommitMetaJson { 
        
               #[serde(rename = "coreos-assembler.basearch")] 
        
               basearch: String, 
        
               #[serde(rename = "fedora-coreos.stream")] 
        
               stream: String, 
        
           }

But we could push these requirements a bit further down in the flow.
We strictly need that data for the booted deployment, and for other deployments that we may want to skip (e.g. rolled back faulty FCOS releases).
It should be safe to just ignore deployments that are not booted, or that are missing the fields required for proper introspection.

cgwalters · 2022-11-01T21:00:56Z

I did #876 related to this.

(Also coreos/rpm-ostree#2829 is now fixed)

craigcabrey · 2023-03-17T02:57:21Z

Is there a mitigation/workaround for this problem while we wait for a release with fixes?

dustymabe · 2023-03-17T03:01:56Z

I think my workaround at the time was to clean up the non booted deployment. I can't remember if it was rpm-ostree cleanup -r or rpm-ostree cleanup -p that I used at the time.

craigcabrey · 2023-03-17T03:23:10Z

Ah. In my case I intend to stay on an OCI image, so there are no deployments to remove. Zincati is just broken on these systems.

dustymabe · 2023-03-17T03:27:32Z

Yes and it won't work at all until coreos/fedora-coreos-tracker#1263 is implemented so you can just disable zincati.service for now. It does mean that you own your updates, though.

craigcabrey · 2023-03-17T03:28:57Z

That's what I figured, thanks for the confirmation!

The previous work resulted in an error, but let's just exit with status zero because otherwise we end up tripping up things like checks for failing units. Anyone who has rebased into a container has very clearly taken explicit control over the wheel and there's no point in us erroring out. Closes: coreos#859

cgwalters · 2023-11-03T19:46:23Z

This one has been fixed for a while, I believe all the way back by #876

barnscott · 2024-03-25T13:54:29Z

I appreciate that this issue is closed, but i'm experiencing an issue that looks similar, and was hoping to get confirmation if this functionality is expected to work for custom container-images. In the previous comments, I believe @dustymabe says the functionality is broken while @cgwalters suggests it might be working.

Config:

[updates]
strategy = "immediate"

Error:

Mar 25 13:44:09 mb01 systemd[1]: Starting zincati.service - Zincati Update Agent...
Mar 25 13:44:09 mb01 zincati[11851]: [INFO  zincati::cli::agent] starting update agent (zincati 0.0.27)
Mar 25 13:44:09 mb01 zincati[11851]: [ERROR zincati] error: failed to assemble configuration settings
Mar 25 13:44:09 mb01 zincati[11851]: [ERROR zincati]  -> failed to validate agent identity configuration
Mar 25 13:44:09 mb01 zincati[11851]: [ERROR zincati]  -> failed to build default identity
Mar 25 13:44:09 mb01 zincati[11851]: [ERROR zincati]  -> failed to introspect booted OS image
Mar 25 13:44:09 mb01 zincati[11851]: [ERROR zincati]  -> Automatic updates disabled; booted into container image ostree-unverified-regi>
Mar 25 13:44:09 mediabarn systemd[1]: Started zincati.service - Zincati Update Agent.
Mar 25 13:44:09 mediabarn systemd[1]: zincati.service: Deactivated successfully

dustymabe · 2024-03-25T14:40:26Z

Right now if you rebase to a custom container image then you now own the updates. i.e. you need to push a new built container to the registry/repo and either manually do the update (rpm-ostree upgrade) or set it up on a timer.

In this case it would be best to just systemctl disable zincati since it won't be useful.

barnscott · 2024-03-25T14:51:31Z

Got it thank you! A timer will work for my immediate requirement, but I have another use case for fleet-lock, so I'll keep an eye on: coreos/fedora-coreos-tracker#1263

lucab added kind/bug area/rpm-ostree labels Oct 25, 2022

dustymabe mentioned this issue Nov 3, 2023

[rawhide] Switch to deploying via container image coreos/fedora-coreos-config#2711

Merged

cgwalters mentioned this issue Nov 3, 2023

Just exit normally when booted into a container #1117

Merged

cgwalters closed this as completed Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zincati service fails to start if non-conforming ostree deployment exists #859

zincati service fails to start if non-conforming ostree deployment exists #859

dustymabe commented Oct 11, 2022

lucab commented Oct 12, 2022

dustymabe commented Oct 12, 2022

jlebon commented Oct 12, 2022

dustymabe commented Oct 12, 2022

lucab commented Oct 12, 2022 •

edited

Loading

cgwalters commented Nov 1, 2022 •

edited

Loading

craigcabrey commented Mar 17, 2023

dustymabe commented Mar 17, 2023

craigcabrey commented Mar 17, 2023

dustymabe commented Mar 17, 2023

craigcabrey commented Mar 17, 2023

cgwalters commented Nov 3, 2023

barnscott commented Mar 25, 2024

dustymabe commented Mar 25, 2024 •

edited

Loading

barnscott commented Mar 25, 2024

zincati service fails to start if non-conforming ostree deployment exists #859

zincati service fails to start if non-conforming ostree deployment exists #859

Comments

dustymabe commented Oct 11, 2022

Bug Report

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

lucab commented Oct 12, 2022

dustymabe commented Oct 12, 2022

jlebon commented Oct 12, 2022

dustymabe commented Oct 12, 2022

lucab commented Oct 12, 2022 • edited Loading

cgwalters commented Nov 1, 2022 • edited Loading

craigcabrey commented Mar 17, 2023

dustymabe commented Mar 17, 2023

craigcabrey commented Mar 17, 2023

dustymabe commented Mar 17, 2023

craigcabrey commented Mar 17, 2023

cgwalters commented Nov 3, 2023

barnscott commented Mar 25, 2024

dustymabe commented Mar 25, 2024 • edited Loading

barnscott commented Mar 25, 2024

lucab commented Oct 12, 2022 •

edited

Loading

cgwalters commented Nov 1, 2022 •

edited

Loading

dustymabe commented Mar 25, 2024 •

edited

Loading