Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cronosd logs are either far too noisy, or not noisy enough #399

Open
tsutsu opened this issue Mar 17, 2022 · 1 comment
Open

cronosd logs are either far too noisy, or not noisy enough #399

tsutsu opened this issue Mar 17, 2022 · 1 comment

Comments

@tsutsu
Copy link

tsutsu commented Mar 17, 2022

Describe the bug

With cronosd configured with --log-level info writing to the systemd journal, I get ~2GB of (compressed!) journals produced per day. That's an order of magnitude greater log volume than any other blockchain node we run. It's to the point that journald is using a nontrivial fraction of a CPU core compressing all the log data coming in, and is potentially slowing down the node when backpressure from journald accepting logs causes cronosd's stdout to block in write(2).

With cronosd configured with the next-higher-up log-level, --log-level warn, the log is still pretty noisy, but better; but now there are no lifecycle messages. If I'm debugging a startup failure, I can't just watch the log to tell when the node has started successfully. Instead, I have to use the node's own status endpoint. This is pretty annoying/unusual from a sysadmin perspective, and may be untenable, if I'm e.g. trying to diagnose a problem that caused a VM restart from VM serial-console output.

Expected behavior

Notice log-level

cronosd should have a notice log-level (one higher than info), and important lifecycle messages — e.g. "the node is now ready to serve RPC connections", or "the node has started/stopped making sync progress" — should be output at this log-level.

Examples of messages that should be notice:

  • all Starting %s messages at startup
  • commit synced
  • committed state
  • executed block

notice messages should either be one-time per node run, or at least infrequent (see also: Geth's approach of keeping a last-emitted timestamp, and only emitting lifecycle messages if sufficient time has elapsed since they were last emitted)

(Alternately, info can be used for lifecycle messages, if all non-lifecycle messages are demoted from info to some lower level. This is what Geth does, but I don't recommend this approach. There are definitely messages you want to see while testing your configuration of a node, without all the noise of development-time debugging messages. These are what info is for. While notice is for the messages that remain relevant once the node is configured and running in production in a "stable" configuration.)

O(N) log messages

cronosd should not be emitting any kind of O(N)-in-number-of-log-lines events per work step, except at the finest-grained log-level (usually trace)

Examples of O(N) messages:

  • burned tokens + minted coins

Protocol errors

cronosd should treat predictable "protocol errors" that don't violate runtime-validity preconditions of the local node as info, not errors. The error log-level is for the attention of the sysadmin running the node; if an "error" is, from the operator's perspective, the node reacting as desired to bad behavior from network peers, then it's not an error, it's just info, or even just debug.

Examples of messages that shouldn't be error:

  • dialing failed
  • Stopping peer for error
  • error while stopping connection (except special cases; this is almost always due to the peer having already closed their side of the connection, and so being unavailable to receive the shutdown packet)

I understand that these messages might be relevant to a sysadmin if the node is finding itself unable to connect to any peers. But in such cases, the node isn't doing anything productive at that point anyway, so the sysadmin could simply restart the node with a higher log-level. (I would also suggest having a periodic notice lifecycle message with a metrics-like summary of counts of active, failed, and stopped peers, which would be enough for a sysadmin to tell that the node isn't successfully making any connections.)

Non-framework logs

Messages should always go through the logging subsystem and be filtered by log-level.

Examples of messages not using the logging framework + incorrectly being printed when the log-level is set above them:

  • t=2022-03-17T15:46:56+0000 lvl=dbug msg="Served eth_getBalance" conn=127.0.0.1:57712 reqid=4560398 t=43.942675m

Differential logging for RPC

Some assertion-failure logic paths should be considered an error to raise to the attention of a sysadmin if they occur during sync; but not if they occur during the execution of an RPC query, where the user's own input is what is failing the assertion. For RPC nodes specifically, ensure that errors that are directly caused by user-provided RPC parameters, are caught/discarded, or at most only bubble up to a single log message logging the result of the RPC call.

Examples of errors that should be suppressed if they occur under an RPC call:

  • account not found error="account crc1x5qzvmjljptk98swzdr8f56lu9gd662p5y026e does not exist: unknown address" cosmos-address=crc1x5qzvmjljptk98swzdr8f56lu9gd662p5y026e ethereum-address=0x3500266E5F9057629e0E134674D35fe150Dd6941 module=evm
@yihuang
Copy link
Collaborator

yihuang commented Mar 18, 2022

Totally agree, some of the points that are improved in v0.7.0:

O(N) log messages

The burn/mint logs are removed in v0.7.0, and all the logs during contract execution are under debug level now.

Non-framework logs

The go-ethereum logs are also directed to cosmos-sdk logger, evmos/ethermint#948, for example, the RPC access logs are under debug level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants