-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cosmos-sdk vat snapshot/transcript retention configuration #10032
Conversation
slogfile = "{{ .Swingset.SlogFile }}" | ||
|
||
# The maximum number of vats that the SwingSet kernel will bring online. A lower number | ||
# requires less memory but may have a negative performance impact if vats need to | ||
# be frequently paged out to remain under this limit. | ||
max_vats_online = {{ .Swingset.MaxVatsOnline }} | ||
max-vats-online = {{ .Swingset.MaxVatsOnline }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@siarhei-agoric @mhofman This aligns with existing cosmos-sdk conventions; cf. https://github.com/cosmos/cosmos-sdk/blob/68cb1635a0a0c7b3bd923ec9a51d0530e47d8040/server/config/config.toml.tpl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it's tendermint that uses underscore !
// Transcripts are broken up into "spans", delimited by heap snapshots. | ||
// For every vatID, there will be exactly one current span with isCurrent=1, | ||
// and zero or more non-current (historical) spans with isCurrent=null. | ||
// If we take a heap snapshot after the first hundred deliveries and again | ||
// after the second hundred (i.e., after zero-indexed deliveries 99 and 199), | ||
// and have not yet performed a delivery after the second snapshot, we'll have | ||
// two historical spans (one with startPos=0 and endPos=100, the second with | ||
// startPos=100 and endPos=200) and a single empty current span with | ||
// startPos=200 and endPos=200. After we perform the next delivery, the | ||
// single current span will still have startPos=200 but will now have | ||
// endPos=201. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@warner I updated this to avoid mixing zero-indexed and one-indexed values; please look it over for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's correct. Notes:
- "exactly one current span with isCurrent=1" admits the thought that maybe some "current spans" don't have isCurrent=1. Maybe say "exactly one current span (i.e. isCurrent=1)" ? Ditto for the second clause.
- Yep, my previous wording was confused, if the first span included a delivery=100, then it should have had endPos=101, because we increment the span record's
.endPos
just after adding the item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually ever have empty spans? I though the first entry of a span was to include a "load snapshot" entry. Maybe the first span can be empty, but I think that has a load entry as well?
I believe that means pos
are not strictly equivalent to deliveries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is best reviewed while ignoring whitespace changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Should we add a "replay" option for consistency? I suppose that without a plan for deletion, that could be implemented the same as "archival" for now. |
One test would be to bootstrap a node with "operational" mode, and verify that you cannot export the swing-store with "replay" or "archival" mode (previous to the config change you could) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I don't know Go well enough to approve that side (please wait for @mhofman 's r+), but the JS side looks ok.
// Transcripts are broken up into "spans", delimited by heap snapshots. | ||
// For every vatID, there will be exactly one current span with isCurrent=1, | ||
// and zero or more non-current (historical) spans with isCurrent=null. | ||
// If we take a heap snapshot after the first hundred deliveries and again | ||
// after the second hundred (i.e., after zero-indexed deliveries 99 and 199), | ||
// and have not yet performed a delivery after the second snapshot, we'll have | ||
// two historical spans (one with startPos=0 and endPos=100, the second with | ||
// startPos=100 and endPos=200) and a single empty current span with | ||
// startPos=200 and endPos=200. After we perform the next delivery, the | ||
// single current span will still have startPos=200 but will now have | ||
// endPos=201. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's correct. Notes:
- "exactly one current span with isCurrent=1" admits the thought that maybe some "current spans" don't have isCurrent=1. Maybe say "exactly one current span (i.e. isCurrent=1)" ? Ditto for the second clause.
- Yep, my previous wording was confused, if the first span included a delivery=100, then it should have had endPos=101, because we increment the span record's
.endPos
just after adding the item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is good. Some minor feedback, feel free to merge once addressed.
# If relative, it is interpreted against the application home directory | ||
# (e.g., ~/.agoric). | ||
# May be overridden by a SLOGFILE environment variable, which if relative is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overridden? I thought the config took precedence if present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, CLI option > environment variable > config file: https://github.com/spf13/viper?tab=readme-ov-file#why-viper
Viper uses the following precedence order. Each item takes precedence over the item below it:
- explicit call to Set
- flag
- env
- config
- key/value store
- default
golang/cosmos/x/swingset/config.go
Outdated
# * "default": determined by ` + "`pruning`" + ` ("archival" if ` + "`pruning`" + ` is | ||
# "nothing", otherwise "operational") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: this is sufficient as even though a pruning strategy of "nothing" internally sets KeepRecent
to 0
, that value is in fact not a valid value otherwise (2 is the minimum enforced)
// See CustomAppConfig in ../../daemon/cmd/root.go. | ||
type ExtendedConfig struct { | ||
serverconfig.Config `mapstructure:",squash"` | ||
Swingset SwingsetConfig `mapstructure:"swingset"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yikes I missed the missing mapstructure
in the last PR review, how did it even work, does it default to a lowercasing of the struct property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so.
}); | ||
const { transcriptStore } = kernelStorage; | ||
const { commit, close } = hostStorage; | ||
const testTranscriptStore = test.macro({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we should, and if there's an easy way, test that older historical spans are not removed when disabling keepTranscripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need to import data that already includes a historical span, which is possible but not trivial. Given that, it didn't seem worth it because I think we don't actually care that they are left around.
// last snapshot of their vat) | ||
// * "default": determined by `pruning` ("archival" if `pruning` is | ||
// "nothing", otherwise "operational") | ||
VatTranscriptRetention string `mapstructure:"vat-transcript-retention" json:"vatTranscriptRetention,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure omitempty
makes sense here. I'd argue this ends up required at the JS/golang interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I'd rather not send an empty string only for this field if that ever comes up. If it truly were required, we'd enforce that on the JS side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how it could ever be empty given the resolution logic
swingsetConfig; | ||
const keepSnapshots = | ||
vatSnapshotRetention === 'debug' || | ||
(!vatSnapshotRetention && ['1', 'true'].includes(XSNAP_KEEP_SNAPSHOTS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I feel about the env variable "overriding" an explicit "operational"
retention in the config. I think that's fine, but maybe a comment here making the intention explicit would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That actually wouldn't happen, because of the !vatSnapshotRetention
guard. But at any rate, I've updated both keepSnapshots
and keepTranscripts
to more clearly apply fallback logic only when the config value is falsy.
Deploying agoric-sdk with Cloudflare Pages
|
9bd39ae
to
f61c1c1
Compare
The preferred separator is "-" rather than "_", cf. https://github.com/cosmos/cosmos-sdk/blob/a57b25418a594ae023274673f07b72611ccd2744/server/config/config.toml.tpl
…script span Fixes #9387
…to AG_COSMOS_INIT Fixes #9386
f61c1c1
to
8b3a6d4
Compare
…lid value error message Ref #10032
…lid value error message (#10081) Ref #10032 ## Description ```diff -value for vat-transcript-retention must be in ["archival" "operational"] +value for vat-transcript-retention must be in ["archival" "operational" "default"] ``` ### Security Considerations n/a ### Scaling Considerations n/a ### Documentation Considerations n/a ### Testing Considerations n/a ### Upgrade Considerations n/a
Ref #9174
Fixes #9387
Fixes #9386
TODO:
Description
Adds consensus-independent
vat-snapshot-retention
("debug" vs. "operational") andvat-transcript-retention
("archival" vs. "operational" vs. "default") cosmos-sdk swingset configuration (values chosen to correspond withartifactMode
) for propagation in AG_COSMOS_INIT. The former defaults to "operational" and the latter defaults to "default", which infers a value from cosmos-sdkpruning
to allow simple configuration of archiving nodes.It also updates the semantics of TranscriptStore
keepTranscripts: false
configuration to remove items from only the previously-current span rather than from all previous spans when rolling over (to avoid expensive database churn). Removal of older items can be accomplished by reloading from an export that does not include them.Security Considerations
I don't think this changes any relevant security posture.
Scaling Considerations
This will reduce the SQLite disk usage for any node that is not explicitly configured to retain snapshots and/or transcripts. The latter in particular is expected to have significant benefits for mainnet (as noted in #9174, about 116 GB ÷ 147 GB ≈ 79% of the database on 2024-03-29 was vat transcript items).
Documentation Considerations
The new fields are documented in our default TOML template, and captured in a JSDoc type on the JavaScript side.
Testing Considerations
This PR extends coverage TranscriptStore to include
keepTranscripts
true vs. false, but I don't see a good way to cover Go→JS propagation other than manually (which I have done). It should be possible to add testing for the use and validation ofresolvedConfig
in AG_COSMOS_INIT handling, but IMO that is best saved for after completion of split-brain (to avoid issues with same-process Go–JS entanglement).Upgrade Considerations
This is all kernel code that can be used at any node restart (i.e., because the configuration is consensus-independent, it doesn't even need to wait for a chain software upgrade). But we should mention the new cosmos-sdk configuration in release notes, because it won't be added to existing app.toml files already in use.