Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zincati service fails to start if non-conforming ostree deployment exists #859

Closed
dustymabe opened this issue Oct 11, 2022 · 15 comments
Closed

Comments

@dustymabe
Copy link
Member

Bug Report

I recently was experimenting with the quay.io/fedora/fedora-coreos:next-devel container on one of my systems. I rebased it using the following command and left it like that for a few releases (manually running rpm-ostree upgrade each cycle).

sudo rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/fedora/fedora-coreos:next-devel

I then decided to go back to having automatic updates working (zincati + OSTree repo) so I rebased back to what I had to begin with:

sudo rpm-ostree rebase fedora:fedora/x86_64/coreos/next

and ended up with:

[core@weevm ~]$ rpm-ostree status 
State: idle
Deployments:
● fedora:fedora/x86_64/coreos/next
                  Version: 37.20221003.1.0 (2022-10-03T17:40:38Z)
                   Commit: df296c944c11261f69f0e3b26d04b7e4016e3f5dd38cd3cc4cd05deab72f7474
             GPGSignature: Valid signature by ACB5EE4E831C74BB7C168D27F55AD3FB5323552A

  ostree-unverified-registry:quay.io/fedora/fedora-coreos:next-devel
                   Digest: sha256:10e906cc2514a8098343eb8e813d59e549af47fb43ecbc2345f3c40a0d3ba702
                Timestamp: 2022-09-29T19:39:44Z

However I noticed zincati fails to start now:

Oct 11 19:09:43 weevm systemd[1]: Starting zincati.service - Zincati Update Agent...
Oct 11 19:09:43 weevm zincati[442167]: [INFO  zincati::cli::agent] starting update agent (zincati 0.0.24)
Oct 11 19:09:43 weevm zincati[442167]: [ERROR zincati] critical error: failed to assemble configuration settings
Oct 11 19:09:43 weevm zincati[442167]: [ERROR zincati]  -> failed to validate agent identity configuration
Oct 11 19:09:43 weevm zincati[442167]: [ERROR zincati]  -> failed to build default identity
Oct 11 19:09:43 weevm zincati[442167]: [ERROR zincati]  -> missing field `coreos-assembler.basearch` at line 77 column 7
Oct 11 19:09:43 weevm systemd[1]: zincati.service: Main process exited, code=exited, status=1/FAILURE
Oct 11 19:09:43 weevm systemd[1]: zincati.service: Failed with result 'exit-code'.
Oct 11 19:09:43 weevm systemd[1]: Failed to start zincati.service - Zincati Update Agent.
Oct 11 19:09:54 weevm systemd[1]: zincati.service: Scheduled restart job, restart counter is at 52512.
Oct 11 19:09:54 weevm systemd[1]: Stopped zincati.service - Zincati Update Agent.

This is because there is apparently some data missing about the container deployment that causes zincati to barf and not continue best effort. I would assume that since the booted deployment is good zincati should be able to continue.

After cleaning up the rollback deployment with sudo rpm-ostree rollback -r zincati was able to start.

Environment

QEMU - FCOS at 37.20221003.1.0

Expected Behavior

Able to start zincati.

Actual Behavior

zincati fails to start up.

Reproduction Steps

  1. boot system
  2. sudo rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/fedora/fedora-coreos:next and reboot
  3. sudo rpm-ostree rebase fedora:fedora/x86_64/coreos/next and reboot
  4. zincati.service should fail to start now because rollback deployment doesn't conform.

Other Information

@lucab
Copy link
Contributor

lucab commented Oct 12, 2022

Thanks for the report and the lengthy reproducer. I do agree this is an unexpected behavior, Zincati should be able to proceed based on the metadata on the booted deployment.

Looking into the inner logic, I think you are actually hitting a bug in rpm-ostree which is not properly handing the combination of --json and --booted in its status output. If you look at your node while in the intermediate rollback state, you'll observe this:

$ rpm-ostree status --json --booted | jq '.deployments | length'
2

Thus Zincati sees two deployments and tries to deserialize both of them. The "is-booted" filtering would happen after that, but the containerized deployment cannot be parsed as it is missing some of the required fields.

Searching through the bug tracker, this has already been reported in the past: coreos/rpm-ostree#2829.

@dustymabe
Copy link
Member Author

Thanks for looking into it @lucab - I guess we should try to prioritize that rpm-ostree issue if we want to fix this problem.

@jlebon
Copy link
Member

jlebon commented Oct 12, 2022

Though ideally it seems like the metadata Zincati is looking for should be retained even in the container case.

@dustymabe
Copy link
Member Author

Though ideally it seems like the metadata Zincati is looking for should be retained even in the container case.

Indeed that is another thing that comes out of this that we should fix.

@lucab
Copy link
Contributor

lucab commented Oct 12, 2022

I do agree with all the statements above.

Additionally, I think it would be good to improve Zincati logic anyway to make it slightly more resilient against this kind of local unexpected state. Right now we do expect all of the following fields:

/// Partial deployment object (only fields relevant to zincati).
#[derive(Clone, Debug, Deserialize)]
#[serde(rename_all = "kebab-case")]
pub struct DeploymentJson {
booted: bool,
base_checksum: Option<String>,
#[serde(rename = "base-commit-meta")]
base_metadata: BaseCommitMetaJson,
checksum: String,
// NOTE(lucab): missing field means "not staged".
#[serde(default)]
staged: bool,
version: String,
}
/// Metadata from base commit (only fields relevant to zincati).
#[derive(Clone, Debug, Deserialize)]
struct BaseCommitMetaJson {
#[serde(rename = "coreos-assembler.basearch")]
basearch: String,
#[serde(rename = "fedora-coreos.stream")]
stream: String,
}

But we could push these requirements a bit further down in the flow.
We strictly need that data for the booted deployment, and for other deployments that we may want to skip (e.g. rolled back faulty FCOS releases).
It should be safe to just ignore deployments that are not booted, or that are missing the fields required for proper introspection.

@cgwalters
Copy link
Member

cgwalters commented Nov 1, 2022

I did #876 related to this.

(Also coreos/rpm-ostree#2829 is now fixed)

@craigcabrey
Copy link

Is there a mitigation/workaround for this problem while we wait for a release with fixes?

@dustymabe
Copy link
Member Author

I think my workaround at the time was to clean up the non booted deployment. I can't remember if it was rpm-ostree cleanup -r or rpm-ostree cleanup -p that I used at the time.

@craigcabrey
Copy link

Ah. In my case I intend to stay on an OCI image, so there are no deployments to remove. Zincati is just broken on these systems.

@dustymabe
Copy link
Member Author

Yes and it won't work at all until coreos/fedora-coreos-tracker#1263 is implemented so you can just disable zincati.service for now. It does mean that you own your updates, though.

@craigcabrey
Copy link

That's what I figured, thanks for the confirmation!

cgwalters added a commit to cgwalters/zincati that referenced this issue Nov 3, 2023
The previous work resulted in an error, but let's just
exit with status zero because otherwise we end up tripping up
things like checks for failing units.

Anyone who has rebased into a container has very clearly
taken explicit control over the wheel and there's no
point in us erroring out.

Closes: coreos#859
@cgwalters
Copy link
Member

This one has been fixed for a while, I believe all the way back by #876

@barnscott
Copy link

I appreciate that this issue is closed, but i'm experiencing an issue that looks similar, and was hoping to get confirmation if this functionality is expected to work for custom container-images. In the previous comments, I believe @dustymabe says the functionality is broken while @cgwalters suggests it might be working.

Config:

[updates]
strategy = "immediate"

Error:

Mar 25 13:44:09 mb01 systemd[1]: Starting zincati.service - Zincati Update Agent...
Mar 25 13:44:09 mb01 zincati[11851]: [INFO  zincati::cli::agent] starting update agent (zincati 0.0.27)
Mar 25 13:44:09 mb01 zincati[11851]: [ERROR zincati] error: failed to assemble configuration settings
Mar 25 13:44:09 mb01 zincati[11851]: [ERROR zincati]  -> failed to validate agent identity configuration
Mar 25 13:44:09 mb01 zincati[11851]: [ERROR zincati]  -> failed to build default identity
Mar 25 13:44:09 mb01 zincati[11851]: [ERROR zincati]  -> failed to introspect booted OS image
Mar 25 13:44:09 mb01 zincati[11851]: [ERROR zincati]  -> Automatic updates disabled; booted into container image ostree-unverified-regi>
Mar 25 13:44:09 mediabarn systemd[1]: Started zincati.service - Zincati Update Agent.
Mar 25 13:44:09 mediabarn systemd[1]: zincati.service: Deactivated successfully

@dustymabe
Copy link
Member Author

dustymabe commented Mar 25, 2024

Right now if you rebase to a custom container image then you now own the updates. i.e. you need to push a new built container to the registry/repo and either manually do the update (rpm-ostree upgrade) or set it up on a timer.

In this case it would be best to just systemctl disable zincati since it won't be useful.

@barnscott
Copy link

Got it thank you! A timer will work for my immediate requirement, but I have another use case for fleet-lock, so I'll keep an eye on: coreos/fedora-coreos-tracker#1263

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants