context building slog sender #10269

mhofman · 2024-10-13T06:51:08Z

What is the Problem Being Solved?

Our slog is a stream of structured events generated by the swingset kernel and the cosmic-swingset host. Most events actually come in pairs, with other events potentially temporally contained within them. These surrounding events create a context for the nested events, but for terseness and layering reasons, the data from this context is not always repeated within the nested events. When streaming slog entries as-is, this lack of context data on individual events makes it very difficult to query for related events.

Slog events are processed by "slog senders". The most basic one just writes the raw events to a jsonl file. We also have a slog-to-otel converter, but it attempts to create telemetry traces, which are great to represent the causality links in our execution model, but are too complex for simple querying of events. While the current implementation of that sender is due for a revamp (I have some local branches attempting this, and a short description of the design at https://github.com/Agoric/agoric-sdk/blob/mhofman/5724-otel-refactor/packages/telemetry/notes.txt), a more basic contextualizing log sender would enable more simple querying.

Details of the slog entries

While slog events come mostly in pair, the context info we're interested in is not purely nested.

From the JS point of view, the execution is nested as follow:

External event injected into swingset run queue (e.g. timer poll or cosmos message)
Series of cranks, some resulting in deliveries to vats
A delivery can contain a series of syscalls or console events

From the host point of view, we see the following events:

kernel init (along with a bunch of nested events), one time, followed by a repeating series of block events:
begin block (single event)
end block, containing a series of run:
- run 0 is always the continuation of any previous unfinished work
- some runs for high priority messages
- timer poll run
- some runs for low priority messages
commit block

A run contains a series of events described above detailing JS execution. If the cumulative amount of execution passes a threshold, the current run is suspended, and the end block completes. As such the events related to a JS execution to quiescence of swingset may be spread over multiple blocks. Also a cosmos message received in a given block may only execute in a later block.

We're interested in the ability to query events by both block and trigger events

Description of the Design

A new slog sender building a context based on previously seen slog events, which streams these augmented events to a logging service which supports querying.

The context can be mostly ephemeral since a restart of the process will always start by processing a block from its beginning (after init). However the exception to this is the context of the trigger event, as the run 0 may continue execution that was started in a previous block. As such the latest trigger event context needs to be persisted (slog senders do not have access to the host DB, and need to arrange their own persistence).

We likely shouldn't tie the implementation to a specific log event receiver, and can likely leverage the otel logging libraries.

Proposed details of slog event context builder

At any point there is a set of potential current context objects: init, block, trigger, crank, replay

These contexts merged together should be used as the base object for the attributes of the log record. Each slog events may have some fields extracted (removed) and instead included as extra log record attributes. The rest will be included as the body of the log record. Unless otherwise noted, the slog event is reported after and thus using the contexts that may have been updated using the event's data.

For all events, the time field is extracted from the slog event and used as timestamp of the log record. The monotime field is also extracted and used as a process.uptime attribute.

For the kernel-init-start event, build an init context with init: true, so that the nested events can easily be filtered out if necessary. Unset the init context after the kernel-init-finish event was reported.

For the cosmic-swingset-begin-block event, create a new block context formed as the following:

block.height: ${blockHeight}
block.time: ${blockTime}

Leave blockHeight and blockTime as body fields of the log record.

For the cosmic-swingset-end-block-start event, assert there is block context. Leave blockHeight and blockTime as body fields of the log record.
Restore any persisted trigger context immediately after reporting the end-block-start so it's available for run 0 events.

For the cosmic-swingset-end-block-finish, cosmic-swingset-commit-start and cosmic-swingset-commit-finish events, assert there is a block context and report normally.
For the cosmic-swingset-after-commit-stats event, report the event then unset the block context.

For the cosmic-swingset-bootstrap-block-start event, create a new block context formed as the following:

block.height: ${blockHeight || 0}
block.time: ${blockTime}

Create a trigger context formed as the following, but do not persist it:

run.id: bootstrap-${blockTime}
run.trigger.type: bootstrap
run.trigger.time: ${blockTime}

For the cosmic-swingset-bootstrap-block-finish event, report the event then unset the block context (the trigger context should already have been unset by the run-finish event).

For the cosmic-swingset-bridge-inbound or cosmic-swingset-deliver-inbound events, create and persist a trigger context formed as the following:

run.id: bridge-${inboundNum} or deliver-${inboundNum}
run.trigger.type: bridge or deliver
run.trigger.source: ${source} (for bridge)
run.trigger.sender: ${sender} (for deliver)
run.trigger.blockHeight: ${inboundNum.split('-')[0]}
run.trigger.txHash: ${inboundNum.split('-')[1]}
run.trigger.msgIdx: ${inboundNum.split('-')[2]}

For the cosmic-swingset-run-start event, if there is no trigger context, and the runNum field is not 0, we assume the trigger is a timer poll. This is because there is no explicit slog event for the timer poll. There is unfortunately another trigger for a run without explicit slog event for the trigger: INSTALL_BUNDLE, but thankfully that event does not currently cause any activity during that run. Any duplication of trigger info would thus be benign (See #10332 to add explicit slog events of both timer poll and install bundle).
If there is no trigger context, and runNum !== 0, create and persist a trigger context formed as the following:

run.id: timer-${blockHeight}
run.trigger.type: timer
run.trigger.time: ${blockTime}

For every cosmic-swingset-run-start event, add the following field to an existing trigger context, or create (but do not persist) a trigger context with the following:

run.num: ${runNum}.

For the cosmic-swingset-run-finish event, report with the current contexts then unset the trigger context. Do not remove any persisted context.

For the crank-start event, create a crank context formed as the following:

crank.num: ${crankNum}
crank.type: ${crankType}

For the crank-result event, report with the current contexts then unset the crank context.

For the clist event, update the crank context with the following (assert the context exists):

crank.vatID: ${vatID}

For the start-replay event, create a replay context with replay: true. Unset the replay context after the finish-replay event was reported.

For the deliver event, update the replay context if it exists (assert that the slog event replay field is true), or the crank context otherwise, with the following:

crank.deliveryNum: ${deliveryNum}`
crank.vatID: ${vatID}

Remove the replay and deliveryNum fields of the slog event and report.
For the deliver-result event, remove the replay and deliveryNum fields and report. There is no need to unset the deliveryNum from the context.

For the syscall and syscall-result events:

remove the replay and deliveryNum fields
extract (and remove) the syscallNum field into a crank.syscallNum attribute

For the console event:

remove the crankNum and deliveryNum fields

Security Considerations

None, this is log reporting

Scaling Considerations

While better for consistency, it's not strictly necessary to flush the persisted trigger context to disk, as logging is not a critical operation. Furthermore when restoring the trigger context on end-block start, the implementation should avoid reading the persisted context from disk when it previously wrote it out.

Test Plan

An integration test is likely sufficient. We possibly want to enable this reporting in a3p to verify that the context building logic does not fail or get out of sync. Like for slog-to-otel, consider enabling the env option to fail the chain if the slog sender errors out.

Verify in a test instagoric the shape and ability to query logs.

Upgrade Considerations

The easiest is to deploy as a chain software update, but since slog senders are configurable, it could also be cherry-picked or even independently installed through a new NPM telemetry package, and configured on an existing deployment.

The text was updated successfully, but these errors were encountered:

closes: #10269 ## Description Adds a slog sender which will build various contexts along the way and report them along with the slogs for better logs querying and identification ### Security Considerations None ### Scaling Considerations This uses a json file storage ### Documentation Considerations This is a new slogger which can be opted into ### Testing Considerations This will be deployed on testnets (already deployed on one of the testnets and log link is added in a comment below) ### Upgrade Considerations This can be configured on existing deployments by bumping the telemetry package

mhofman added enhancement New feature or request telemetry labels Oct 13, 2024

toliaqat assigned usmanmani1122 Oct 16, 2024

usmanmani1122 mentioned this issue Oct 20, 2024

Context building slog sender #10300

Merged

mhofman mentioned this issue Oct 23, 2024

Add trigger slog events for timer and install bundles #10332

Closed

mergify bot closed this as completed in #10300 Oct 29, 2024

mergify bot closed this as completed in acbd3ae Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

context building slog sender #10269

context building slog sender #10269

mhofman commented Oct 13, 2024 •

edited

Loading

context building slog sender #10269

context building slog sender #10269

Comments

mhofman commented Oct 13, 2024 • edited Loading

What is the Problem Being Solved?

Details of the slog entries

Description of the Design

Proposed details of slog event context builder

Security Considerations

Scaling Considerations

Test Plan

Upgrade Considerations

mhofman commented Oct 13, 2024 •

edited

Loading