Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: output log message when contacting the update server recovers #407

Open
dustymabe opened this issue Nov 19, 2020 · 4 comments
Open

Comments

@dustymabe
Copy link
Member

Feature Request

As a linux user I've been conditioned to look at logs. When I'm interested in how Zincati is doing I check systemd/journald and I see:

Nov 07 01:27:38 localhost systemd[1]: Started Zincati Update Agent.
Nov 07 01:27:38 localhost zincati[708]: [INFO ] starting update agent (zincati 0.0.13)
Nov 07 01:27:48 ip-10-0-1-155 zincati[708]: [INFO ] Cincinnati service: https://updates.coreos.fedoraproject.org
Nov 07 01:27:48 ip-10-0-1-155 zincati[708]: [INFO ] agent running on node '7f45197f6bd445a1b118c8c91776ed70', in update group 'default'
Nov 07 01:27:48 ip-10-0-1-155 zincati[708]: [INFO ] initialization complete, auto-updates logic enabled
Nov 11 18:11:48 ip-10-0-1-155 zincati[708]: [ERROR] failed to check Cincinnati for updates: server-side error, code 503: (unknown/generic server error)
Nov 11 22:14:13 ip-10-0-1-155 zincati[708]: [ERROR] failed to check Cincinnati for updates: server-side error, code 503: (unknown/generic server error)
Nov 18 09:09:39 ip-10-0-1-155 zincati[708]: [ERROR] failed to check Cincinnati for updates: server-side error, code 502: (unknown/generic server error)

My first thought is, "oh no, the update server must be having trouble. Or maybe my node networking is off. Either way, is my node going to receive updates?".

Arguably the user (me) should know about the better way to find out about running state by checking the exposed metrics, but it's not easily discoverable on the system without reading docs. I have two proposals to band-aid over the knee-jerk reaction from the user and also lead them to the right way to check things.

Desired Feature

  1. When the agent successfully contacts the update server after having previously failed, it logs a message:
[ERROR] failed to check Cincinnati for updates: server-side error, code 503: (unknown/generic server error)
[ERROR] failed to check Cincinnati for updates: server-side error, code 503: (unknown/generic server error)
[INFO ] successfully contacted Cincinnati update server for updates
  1. Lead the user towards the exposed metrics in the startup log messages

A crude example would be something like:

[INFO ] starting update agent (zincati 0.0.13)
[INFO ] Cincinnati service: https://updates.coreos.fedoraproject.org
[INFO ] agent running on node '7f45197f6bd445a1b118c8c91776ed70', in update group 'default'
[INFO ] initialization complete, auto-updates logic enabled
[INFO ] exposed metrics about currently running state are available at /run/zincati/public/metrics.promsock
[INFO ] use `sudo socat - UNIX-CONNECT:/run/zincati/public/metrics.promsock` to expose in a terminal
[INFO ] see https://coreos.github.io/zincati/usage/metrics/ for more info

This is a bit more verbose and the most possible info we could provide. A variant of this should suffice.

@lucab
Copy link
Contributor

lucab commented Nov 19, 2020

I don't think that point 2 is useful. We already emit such message at debug level.
Promoting it to info level won't hurt, but the reality is that there is a whole metric-gathering infra missing. You are really interested in the state evolution over time, not a just-in-time snapshot. And an history-querying interface, which a raw socat can't provide.

To that extent, a whole tutorial in fcos-docs showing how to plumb Zincati metrics into Prometheus and Grafana would definitely be more useful. I'll try to find some time to write that.

@dustymabe
Copy link
Member Author

What if we just limited it down to just an INFO with a link to the metrics docs webpage?

That webpage could then provide both the dumb sudo socat and also the more comprehensive information you mention.

@lucab
Copy link
Contributor

lucab commented Nov 19, 2020

Regarding point 1, it's kinda reasonable even though metrics are the right tool for this, really.

Specifically, your client observed a total of three backend glitches in more than 10 days. What you don't see there is the amount of non-error requests it completed in the same timespan. Metrics capture both and help you put this in perspective (my quick-math on your logs gives a 0.09% error rate) and let you track the non-error counter in real time.

An info message like "successful contact after X errors" can be added to break apart long strikes, agreed.
However I do expect most cases to be like your logs, where there are sporadic (once a week) errors then followed closely by a "recovered after 1 error" message.
Not that it hurts much, but it doubles the amount of log entries with something which is already collected via metrics. At a cluster level, this increase storage cost/pressure on the log aggregation service.

@kelvinfan001
Copy link
Member

kelvinfan001 commented Jun 7, 2021

Related: Zincati now exposes its current state via a systemd StatusText, so doing systemctl status zincati.service will show that Zincati is alive and well with a message like periodically polling for updates (last checked ...). Though unless the user realizes that the (probably recent) last check didn't result in an error, the user wouldn't immediately know that the checks were successful and the errors in the journal were flakes, so maybe this doesn't directly help with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants