RFE: output log message when contacting the update server recovers #407

dustymabe · 2020-11-19T14:13:27Z

Feature Request

As a linux user I've been conditioned to look at logs. When I'm interested in how Zincati is doing I check systemd/journald and I see:

Nov 07 01:27:38 localhost systemd[1]: Started Zincati Update Agent.
Nov 07 01:27:38 localhost zincati[708]: [INFO ] starting update agent (zincati 0.0.13)
Nov 07 01:27:48 ip-10-0-1-155 zincati[708]: [INFO ] Cincinnati service: https://updates.coreos.fedoraproject.org
Nov 07 01:27:48 ip-10-0-1-155 zincati[708]: [INFO ] agent running on node '7f45197f6bd445a1b118c8c91776ed70', in update group 'default'
Nov 07 01:27:48 ip-10-0-1-155 zincati[708]: [INFO ] initialization complete, auto-updates logic enabled
Nov 11 18:11:48 ip-10-0-1-155 zincati[708]: [ERROR] failed to check Cincinnati for updates: server-side error, code 503: (unknown/generic server error)
Nov 11 22:14:13 ip-10-0-1-155 zincati[708]: [ERROR] failed to check Cincinnati for updates: server-side error, code 503: (unknown/generic server error)
Nov 18 09:09:39 ip-10-0-1-155 zincati[708]: [ERROR] failed to check Cincinnati for updates: server-side error, code 502: (unknown/generic server error)

My first thought is, "oh no, the update server must be having trouble. Or maybe my node networking is off. Either way, is my node going to receive updates?".

Arguably the user (me) should know about the better way to find out about running state by checking the exposed metrics, but it's not easily discoverable on the system without reading docs. I have two proposals to band-aid over the knee-jerk reaction from the user and also lead them to the right way to check things.

Desired Feature

When the agent successfully contacts the update server after having previously failed, it logs a message:

[ERROR] failed to check Cincinnati for updates: server-side error, code 503: (unknown/generic server error)
[ERROR] failed to check Cincinnati for updates: server-side error, code 503: (unknown/generic server error)
[INFO ] successfully contacted Cincinnati update server for updates

Lead the user towards the exposed metrics in the startup log messages

A crude example would be something like:

[INFO ] starting update agent (zincati 0.0.13)
[INFO ] Cincinnati service: https://updates.coreos.fedoraproject.org
[INFO ] agent running on node '7f45197f6bd445a1b118c8c91776ed70', in update group 'default'
[INFO ] initialization complete, auto-updates logic enabled
[INFO ] exposed metrics about currently running state are available at /run/zincati/public/metrics.promsock
[INFO ] use `sudo socat - UNIX-CONNECT:/run/zincati/public/metrics.promsock` to expose in a terminal
[INFO ] see https://coreos.github.io/zincati/usage/metrics/ for more info

This is a bit more verbose and the most possible info we could provide. A variant of this should suffice.

The text was updated successfully, but these errors were encountered:

lucab · 2020-11-19T15:43:31Z

I don't think that point 2 is useful. We already emit such message at debug level.
Promoting it to info level won't hurt, but the reality is that there is a whole metric-gathering infra missing. You are really interested in the state evolution over time, not a just-in-time snapshot. And an history-querying interface, which a raw socat can't provide.

To that extent, a whole tutorial in fcos-docs showing how to plumb Zincati metrics into Prometheus and Grafana would definitely be more useful. I'll try to find some time to write that.

dustymabe · 2020-11-19T16:02:06Z

What if we just limited it down to just an INFO with a link to the metrics docs webpage?

That webpage could then provide both the dumb sudo socat and also the more comprehensive information you mention.

lucab · 2020-11-19T16:06:54Z

Regarding point 1, it's kinda reasonable even though metrics are the right tool for this, really.

Specifically, your client observed a total of three backend glitches in more than 10 days. What you don't see there is the amount of non-error requests it completed in the same timespan. Metrics capture both and help you put this in perspective (my quick-math on your logs gives a 0.09% error rate) and let you track the non-error counter in real time.

An info message like "successful contact after X errors" can be added to break apart long strikes, agreed.
However I do expect most cases to be like your logs, where there are sporadic (once a week) errors then followed closely by a "recovered after 1 error" message.
Not that it hurts much, but it doubles the amount of log entries with something which is already collected via metrics. At a cluster level, this increase storage cost/pressure on the log aggregation service.

kelvinfan001 · 2021-06-07T14:29:13Z

Related: Zincati now exposes its current state via a systemd StatusText, so doing systemctl status zincati.service will show that Zincati is alive and well with a message like periodically polling for updates (last checked ...). Though unless the user realizes that the (probably recent) last check didn't result in an error, the user wouldn't immediately know that the checks were successful and the errors in the journal were flakes, so maybe this doesn't directly help with this issue.

lucab mentioned this issue Feb 19, 2021

tutorial: collecting and graphing Zincati metrics coreos/fedora-coreos-docs#250

Open

lucab mentioned this issue Feb 1, 2022

zincati not updating CoreOS coreos/fedora-coreos-tracker#1084

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: output log message when contacting the update server recovers #407

RFE: output log message when contacting the update server recovers #407

dustymabe commented Nov 19, 2020

lucab commented Nov 19, 2020

dustymabe commented Nov 19, 2020

lucab commented Nov 19, 2020

kelvinfan001 commented Jun 7, 2021 •

edited

Loading

RFE: output log message when contacting the update server recovers #407

RFE: output log message when contacting the update server recovers #407

Comments

dustymabe commented Nov 19, 2020

Feature Request

Desired Feature

lucab commented Nov 19, 2020

dustymabe commented Nov 19, 2020

lucab commented Nov 19, 2020

kelvinfan001 commented Jun 7, 2021 • edited Loading

kelvinfan001 commented Jun 7, 2021 •

edited

Loading