You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our slog is a stream of structured events generated by the swingset kernel and the cosmic-swingset host. Most events actually come in pairs, with other events potentially temporally contained within them. These surrounding events create a context for the nested events, but for terseness and layering reasons, the data from this context is not always repeated within the nested events. When streaming slog entries as-is, this lack of context data on individual events makes it very difficult to query for related events.
Slog events are processed by "slog senders". The most basic one just writes the raw events to a jsonl file. We also have a slog-to-otel converter, but it attempts to create telemetry traces, which are great to represent the causality links in our execution model, but are too complex for simple querying of events. While the current implementation of that sender is due for a revamp (I have some local branches attempting this, and a short description of the design at https://github.com/Agoric/agoric-sdk/blob/mhofman/5724-otel-refactor/packages/telemetry/notes.txt), a more basic contextualizing log sender would enable more simple querying.
Details of the slog entries
While slog events come mostly in pair, the context info we're interested in is not purely nested.
From the JS point of view, the execution is nested as follow:
External event injected into swingset run queue (e.g. timer poll or cosmos message)
Series of cranks, some resulting in deliveries to vats
A delivery can contain a series of syscalls or console events
From the host point of view, we see the following events:
kernel init (along with a bunch of nested events), one time, followed by a repeating series of block events:
begin block (single event)
end block, containing a series of run:
run 0 is always the continuation of any previous unfinished work
some runs for high priority messages
timer poll run
some runs for low priority messages
commit block
A run contains a series of events described above detailing JS execution. If the cumulative amount of execution passes a threshold, the current run is suspended, and the end block completes. As such the events related to a JS execution to quiescence of swingset may be spread over multiple blocks. Also a cosmos message received in a given block may only execute in a later block.
We're interested in the ability to query events by both block and trigger events
Description of the Design
A new slog sender building a context based on previously seen slog events, which streams these augmented events to a logging service which supports querying.
The context can be mostly ephemeral since a restart of the process will always start by processing a block from its beginning (after init). However the exception to this is the context of the trigger event, as the run 0 may continue execution that was started in a previous block. As such the latest trigger event context needs to be persisted (slog senders do not have access to the host DB, and need to arrange their own persistence).
We likely shouldn't tie the implementation to a specific log event receiver, and can likely leverage the otel logging libraries.
Proposed details of slog event context builder
At any point there is a set of potential current context objects: init, block, trigger, crank, replay
These contexts merged together should be used as the base object for the attributes of the log record. Each slog events may have some fields extracted (removed) and instead included as extra log record attributes. The rest will be included as the body of the log record. Unless otherwise noted, the slog event is reported after and thus using the contexts that may have been updated using the event's data.
For all events, the time field is extracted from the slog event and used as timestamp of the log record. The monotime field is also extracted and used as a process.uptime attribute.
For the kernel-init-start event, build an init context with init: true, so that the nested events can easily be filtered out if necessary. Unset the init context after the kernel-init-finish event was reported.
For the cosmic-swingset-begin-block event, create a new block context formed as the following:
block.height: ${blockHeight}
block.time: ${blockTime}
Leave blockHeight and blockTime as body fields of the log record.
For the cosmic-swingset-end-block-start event, assert there is block context. Leave blockHeight and blockTime as body fields of the log record.
Restore any persisted trigger context immediately after reporting the end-block-start so it's available for run 0 events.
For the cosmic-swingset-end-block-finish, cosmic-swingset-commit-start and cosmic-swingset-commit-finish events, assert there is a block context and report normally.
For the cosmic-swingset-after-commit-stats event, report the event then unset the block context.
For the cosmic-swingset-bootstrap-block-start event, create a new block context formed as the following:
block.height: ${blockHeight || 0}
block.time: ${blockTime}
Create a trigger context formed as the following, but do not persist it:
run.id: bootstrap-${blockTime}
run.trigger.type: bootstrap
run.trigger.time: ${blockTime}
For the cosmic-swingset-bootstrap-block-finish event, report the event then unset the block context (the trigger context should already have been unset by the run-finish event).
For the cosmic-swingset-bridge-inbound or cosmic-swingset-deliver-inbound events, create and persist a trigger context formed as the following:
run.id: bridge-${inboundNum} or deliver-${inboundNum}
For the cosmic-swingset-run-start event, if there is no trigger context, and the runNum field is not 0, we assume the trigger is a timer poll. This is because there is no explicit slog event for the timer poll. There is unfortunately another trigger for a run without explicit slog event for the trigger: INSTALL_BUNDLE, but thankfully that event does not currently cause any activity during that run. Any duplication of trigger info would thus be benign (See #10332 to add explicit slog events of both timer poll and install bundle).
If there is no trigger context, and runNum !== 0, create and persist a trigger context formed as the following:
run.id: timer-${blockHeight}
run.trigger.type: timer
run.trigger.time: ${blockTime}
For every cosmic-swingset-run-start event, add the following field to an existing trigger context, or create (but do not persist) a trigger context with the following:
run.num: ${runNum}.
For the cosmic-swingset-run-finish event, report with the current contexts then unset the trigger context. Do not remove any persisted context.
For the crank-start event, create a crank context formed as the following:
crank.num: ${crankNum}
crank.type: ${crankType}
For the crank-result event, report with the current contexts then unset the crank context.
For the clist event, update the crank context with the following (assert the context exists):
crank.vatID: ${vatID}
For the start-replay event, create a replay context with replay: true. Unset the replay context after the finish-replay event was reported.
For the deliver event, update the replay context if it exists (assert that the slog event replay field is true), or the crank context otherwise, with the following:
crank.deliveryNum: ${deliveryNum}`
crank.vatID: ${vatID}
Remove the replay and deliveryNum fields of the slog event and report.
For the deliver-result event, remove the replay and deliveryNum fields and report. There is no need to unset the deliveryNum from the context.
For the syscall and syscall-result events:
remove the replay and deliveryNum fields
extract (and remove) the syscallNum field into a crank.syscallNum attribute
For the console event:
remove the crankNum and deliveryNum fields
Security Considerations
None, this is log reporting
Scaling Considerations
While better for consistency, it's not strictly necessary to flush the persisted trigger context to disk, as logging is not a critical operation. Furthermore when restoring the trigger context on end-block start, the implementation should avoid reading the persisted context from disk when it previously wrote it out.
Test Plan
An integration test is likely sufficient. We possibly want to enable this reporting in a3p to verify that the context building logic does not fail or get out of sync. Like for slog-to-otel, consider enabling the env option to fail the chain if the slog sender errors out.
Verify in a test instagoric the shape and ability to query logs.
Upgrade Considerations
The easiest is to deploy as a chain software update, but since slog senders are configurable, it could also be cherry-picked or even independently installed through a new NPM telemetry package, and configured on an existing deployment.
The text was updated successfully, but these errors were encountered:
closes: #10269
## Description
Adds a slog sender which will build various contexts along the way and report them along with the slogs for better logs querying and identification
### Security Considerations
None
### Scaling Considerations
This uses a json file storage
### Documentation Considerations
This is a new slogger which can be opted into
### Testing Considerations
This will be deployed on testnets (already deployed on one of the testnets and log link is added in a comment below)
### Upgrade Considerations
This can be configured on existing deployments by bumping the telemetry package
What is the Problem Being Solved?
Our slog is a stream of structured events generated by the swingset kernel and the cosmic-swingset host. Most events actually come in pairs, with other events potentially temporally contained within them. These surrounding events create a context for the nested events, but for terseness and layering reasons, the data from this context is not always repeated within the nested events. When streaming slog entries as-is, this lack of context data on individual events makes it very difficult to query for related events.
Slog events are processed by "slog senders". The most basic one just writes the raw events to a jsonl file. We also have a
slog-to-otel
converter, but it attempts to create telemetry traces, which are great to represent the causality links in our execution model, but are too complex for simple querying of events. While the current implementation of that sender is due for a revamp (I have some local branches attempting this, and a short description of the design at https://github.com/Agoric/agoric-sdk/blob/mhofman/5724-otel-refactor/packages/telemetry/notes.txt), a more basic contextualizing log sender would enable more simple querying.Details of the slog entries
While slog events come mostly in pair, the context info we're interested in is not purely nested.
From the JS point of view, the execution is nested as follow:
From the host point of view, we see the following events:
A run contains a series of events described above detailing JS execution. If the cumulative amount of execution passes a threshold, the current run is suspended, and the end block completes. As such the events related to a JS execution to quiescence of swingset may be spread over multiple blocks. Also a cosmos message received in a given block may only execute in a later block.
We're interested in the ability to query events by both block and trigger events
Description of the Design
A new slog sender building a context based on previously seen slog events, which streams these augmented events to a logging service which supports querying.
The context can be mostly ephemeral since a restart of the process will always start by processing a block from its beginning (after init). However the exception to this is the context of the trigger event, as the run 0 may continue execution that was started in a previous block. As such the latest trigger event context needs to be persisted (slog senders do not have access to the host DB, and need to arrange their own persistence).
We likely shouldn't tie the implementation to a specific log event receiver, and can likely leverage the otel logging libraries.
Proposed details of slog event context builder
At any point there is a set of potential current context objects: init, block, trigger, crank, replay
These contexts merged together should be used as the base object for the attributes of the log record. Each slog events may have some fields extracted (removed) and instead included as extra log record attributes. The rest will be included as the body of the log record. Unless otherwise noted, the slog event is reported after and thus using the contexts that may have been updated using the event's data.
For all events, the
time
field is extracted from the slog event and used as timestamp of the log record. Themonotime
field is also extracted and used as aprocess.uptime
attribute.For the
kernel-init-start
event, build an init context withinit
:true
, so that the nested events can easily be filtered out if necessary. Unset the init context after thekernel-init-finish
event was reported.For the
cosmic-swingset-begin-block
event, create a new block context formed as the following:block.height
:${blockHeight}
block.time
:${blockTime}
Leave
blockHeight
andblockTime
as body fields of the log record.For the
cosmic-swingset-end-block-start
event, assert there is block context. LeaveblockHeight
andblockTime
as body fields of the log record.Restore any persisted trigger context immediately after reporting the end-block-start so it's available for run 0 events.
For the
cosmic-swingset-end-block-finish
,cosmic-swingset-commit-start
andcosmic-swingset-commit-finish
events, assert there is a block context and report normally.For the
cosmic-swingset-after-commit-stats
event, report the event then unset the block context.For the
cosmic-swingset-bootstrap-block-start
event, create a new block context formed as the following:block.height
:${blockHeight || 0}
block.time
:${blockTime}
Create a trigger context formed as the following, but do not persist it:
run.id
:bootstrap-${blockTime}
run.trigger.type
:bootstrap
run.trigger.time
:${blockTime}
For the
cosmic-swingset-bootstrap-block-finish
event, report the event then unset the block context (the trigger context should already have been unset by the run-finish event).For the
cosmic-swingset-bridge-inbound
orcosmic-swingset-deliver-inbound
events, create and persist a trigger context formed as the following:run.id
:bridge-${inboundNum}
ordeliver-${inboundNum}
run.trigger.type
:bridge
ordeliver
run.trigger.source
:${source}
(for bridge)run.trigger.sender
:${sender}
(for deliver)run.trigger.blockHeight
:${inboundNum.split('-')[0]}
run.trigger.txHash
:${inboundNum.split('-')[1]}
run.trigger.msgIdx
:${inboundNum.split('-')[2]}
For the
cosmic-swingset-run-start
event, if there is no trigger context, and therunNum
field is not0
, we assume the trigger is a timer poll. This is because there is no explicit slog event for the timer poll. There is unfortunately another trigger for a run without explicit slog event for the trigger: INSTALL_BUNDLE, but thankfully that event does not currently cause any activity during that run. Any duplication of trigger info would thus be benign (See #10332 to add explicit slog events of both timer poll and install bundle).If there is no trigger context, and
runNum !== 0
, create and persist a trigger context formed as the following:run.id
:timer-${blockHeight}
run.trigger.type
:timer
run.trigger.time
:${blockTime}
For every
cosmic-swingset-run-start
event, add the following field to an existing trigger context, or create (but do not persist) a trigger context with the following:run.num
:${runNum}
.For the
cosmic-swingset-run-finish
event, report with the current contexts then unset the trigger context. Do not remove any persisted context.For the
crank-start
event, create a crank context formed as the following:crank.num
:${crankNum}
crank.type
:${crankType}
For the
crank-result
event, report with the current contexts then unset the crank context.For the
clist
event, update the crank context with the following (assert the context exists):crank.vatID
:${vatID}
For the
start-replay
event, create a replay context withreplay
:true
. Unset the replay context after thefinish-replay
event was reported.For the
deliver
event, update the replay context if it exists (assert that the slog eventreplay
field istrue
), or the crank context otherwise, with the following:crank.deliveryNum
: ${deliveryNum}`crank.vatID
:${vatID}
Remove the
replay
anddeliveryNum
fields of the slog event and report.For the
deliver-result
event, remove thereplay
anddeliveryNum
fields and report. There is no need to unset the deliveryNum from the context.For the
syscall
andsyscall-result
events:replay
anddeliveryNum
fieldssyscallNum
field into acrank.syscallNum
attributeFor the
console
event:crankNum
anddeliveryNum
fieldsSecurity Considerations
None, this is log reporting
Scaling Considerations
While better for consistency, it's not strictly necessary to flush the persisted trigger context to disk, as logging is not a critical operation. Furthermore when restoring the trigger context on end-block start, the implementation should avoid reading the persisted context from disk when it previously wrote it out.
Test Plan
An integration test is likely sufficient. We possibly want to enable this reporting in a3p to verify that the context building logic does not fail or get out of sync. Like for
slog-to-otel
, consider enabling the env option to fail the chain if the slog sender errors out.Verify in a test instagoric the shape and ability to query logs.
Upgrade Considerations
The easiest is to deploy as a chain software update, but since slog senders are configurable, it could also be cherry-picked or even independently installed through a new NPM telemetry package, and configured on an existing deployment.
The text was updated successfully, but these errors were encountered: