You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With cronosd configured with --log-level info writing to the systemd journal, I get ~2GB of (compressed!) journals produced per day. That's an order of magnitude greater log volume than any other blockchain node we run. It's to the point that journald is using a nontrivial fraction of a CPU core compressing all the log data coming in, and is potentially slowing down the node when backpressure from journald accepting logs causes cronosd's stdout to block in write(2).
With cronosd configured with the next-higher-up log-level, --log-level warn, the log is still pretty noisy, but better; but now there are no lifecycle messages. If I'm debugging a startup failure, I can't just watch the log to tell when the node has started successfully. Instead, I have to use the node's own status endpoint. This is pretty annoying/unusual from a sysadmin perspective, and may be untenable, if I'm e.g. trying to diagnose a problem that caused a VM restart from VM serial-console output.
Expected behavior
Notice log-level
cronosd should have a notice log-level (one higher than info), and important lifecycle messages — e.g. "the node is now ready to serve RPC connections", or "the node has started/stopped making sync progress" — should be output at this log-level.
Examples of messages that should be notice:
all Starting %s messages at startup
commit synced
committed state
executed block
notice messages should either be one-time per node run, or at least infrequent (see also: Geth's approach of keeping a last-emitted timestamp, and only emitting lifecycle messages if sufficient time has elapsed since they were last emitted)
(Alternately, info can be used for lifecycle messages, if all non-lifecycle messages are demoted from info to some lower level. This is what Geth does, but I don't recommend this approach. There are definitely messages you want to see while testing your configuration of a node, without all the noise of development-time debugging messages. These are what info is for. While notice is for the messages that remain relevant once the node is configured and running in production in a "stable" configuration.)
O(N) log messages
cronosd should not be emitting any kind of O(N)-in-number-of-log-lines events per work step, except at the finest-grained log-level (usually trace)
Examples of O(N) messages:
burned tokens + minted coins
Protocol errors
cronosd should treat predictable "protocol errors" that don't violate runtime-validity preconditions of the local node as info, not errors. The error log-level is for the attention of the sysadmin running the node; if an "error" is, from the operator's perspective, the node reacting as desired to bad behavior from network peers, then it's not an error, it's just info, or even just debug.
Examples of messages that shouldn't be error:
dialing failed
Stopping peer for error
error while stopping connection (except special cases; this is almost always due to the peer having already closed their side of the connection, and so being unavailable to receive the shutdown packet)
I understand that these messages might be relevant to a sysadmin if the node is finding itself unable to connect to any peers. But in such cases, the node isn't doing anything productive at that point anyway, so the sysadmin could simply restart the node with a higher log-level. (I would also suggest having a periodic notice lifecycle message with a metrics-like summary of counts of active, failed, and stopped peers, which would be enough for a sysadmin to tell that the node isn't successfully making any connections.)
Non-framework logs
Messages should always go through the logging subsystem and be filtered by log-level.
Examples of messages not using the logging framework + incorrectly being printed when the log-level is set above them:
Some assertion-failure logic paths should be considered an error to raise to the attention of a sysadmin if they occur during sync; but not if they occur during the execution of an RPC query, where the user's own input is what is failing the assertion. For RPC nodes specifically, ensure that errors that are directly caused by user-provided RPC parameters, are caught/discarded, or at most only bubble up to a single log message logging the result of the RPC call.
Examples of errors that should be suppressed if they occur under an RPC call:
account not found error="account crc1x5qzvmjljptk98swzdr8f56lu9gd662p5y026e does not exist: unknown address" cosmos-address=crc1x5qzvmjljptk98swzdr8f56lu9gd662p5y026e ethereum-address=0x3500266E5F9057629e0E134674D35fe150Dd6941 module=evm
The text was updated successfully, but these errors were encountered:
Describe the bug
With
cronosd
configured with--log-level info
writing to the systemd journal, I get ~2GB of (compressed!) journals produced per day. That's an order of magnitude greater log volume than any other blockchain node we run. It's to the point thatjournald
is using a nontrivial fraction of a CPU core compressing all the log data coming in, and is potentially slowing down the node when backpressure from journald accepting logs causescronosd
's stdout to block inwrite(2)
.With
cronosd
configured with the next-higher-up log-level,--log-level warn
, the log is still pretty noisy, but better; but now there are no lifecycle messages. If I'm debugging a startup failure, I can't just watch the log to tell when the node has started successfully. Instead, I have to use the node's own status endpoint. This is pretty annoying/unusual from a sysadmin perspective, and may be untenable, if I'm e.g. trying to diagnose a problem that caused a VM restart from VM serial-console output.Expected behavior
Notice log-level
cronosd
should have anotice
log-level (one higher thaninfo
), and important lifecycle messages — e.g. "the node is now ready to serve RPC connections", or "the node has started/stopped making sync progress" — should be output at this log-level.Examples of messages that should be
notice
:Starting %s
messages at startupcommit synced
committed state
executed block
notice
messages should either be one-time per node run, or at least infrequent (see also: Geth's approach of keeping a last-emitted timestamp, and only emitting lifecycle messages if sufficient time has elapsed since they were last emitted)(Alternately,
info
can be used for lifecycle messages, if all non-lifecycle messages are demoted frominfo
to some lower level. This is what Geth does, but I don't recommend this approach. There are definitely messages you want to see while testing your configuration of a node, without all the noise of development-time debugging messages. These are whatinfo
is for. Whilenotice
is for the messages that remain relevant once the node is configured and running in production in a "stable" configuration.)O(N) log messages
cronosd
should not be emitting any kind of O(N)-in-number-of-log-lines events per work step, except at the finest-grained log-level (usuallytrace
)Examples of O(N) messages:
burned tokens
+minted coins
Protocol errors
cronosd
should treat predictable "protocol errors" that don't violate runtime-validity preconditions of the local node asinfo
, noterror
s. Theerror
log-level is for the attention of the sysadmin running the node; if an "error" is, from the operator's perspective, the node reacting as desired to bad behavior from network peers, then it's not anerror
, it's justinfo
, or even justdebug
.Examples of messages that shouldn't be
error
:dialing failed
Stopping peer for error
error while stopping connection
(except special cases; this is almost always due to the peer having already closed their side of the connection, and so being unavailable to receive the shutdown packet)I understand that these messages might be relevant to a sysadmin if the node is finding itself unable to connect to any peers. But in such cases, the node isn't doing anything productive at that point anyway, so the sysadmin could simply restart the node with a higher log-level. (I would also suggest having a periodic
notice
lifecycle message with a metrics-like summary of counts of active, failed, and stopped peers, which would be enough for a sysadmin to tell that the node isn't successfully making any connections.)Non-framework logs
Messages should always go through the logging subsystem and be filtered by log-level.
Examples of messages not using the logging framework + incorrectly being printed when the log-level is set above them:
t=2022-03-17T15:46:56+0000 lvl=dbug msg="Served eth_getBalance" conn=127.0.0.1:57712 reqid=4560398 t=43.942675m
Differential logging for RPC
Some assertion-failure logic paths should be considered an
error
to raise to the attention of a sysadmin if they occur during sync; but not if they occur during the execution of an RPC query, where the user's own input is what is failing the assertion. For RPC nodes specifically, ensure that errors that are directly caused by user-provided RPC parameters, are caught/discarded, or at most only bubble up to a single log message logging the result of the RPC call.Examples of errors that should be suppressed if they occur under an RPC call:
account not found error="account crc1x5qzvmjljptk98swzdr8f56lu9gd662p5y026e does not exist: unknown address" cosmos-address=crc1x5qzvmjljptk98swzdr8f56lu9gd662p5y026e ethereum-address=0x3500266E5F9057629e0E134674D35fe150Dd6941 module=evm
The text was updated successfully, but these errors were encountered: