-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design "snapstore" API: immutable hash-named XS snapshot files #2273
Comments
Should I worry about how much nodejs-specific code we're building on? Or is the kernel more likely to migrate to rust or go than XS so that I shouldn't worry about it?
We don't want xsnap to grow a We seem to be using tmp already.
using zlib in the nodejs API?
We seem to have a hasha dependency, which defaults to |
I would like to keep our options open here, so please do worry about it. Or at least meta-worry -- keep track of the things we would need to worry about. Thanks. |
a prototype of compressed snapshots was pretty straightforward: https://github.com/Agoric/agoric-sdk/tree/2273-snapstore 299194e |
rough notes in preparation for discussion with @warner ... As of #2370 , we have a snapstore API, but it's not integrated with the kernel DB. The current tests are:
I made a local tweak to align them with the design sketch a little better.
✔ create XS Machine, snapshot (417 Kb), compress to 0.1x
✔ build temp file; compress to cache file
Temporary files are created in the pool directory; this is in part to avoid the possibility a temp directory on a different filesystem than the pool directory, which would prevent atomic renaming.
✔ create SES worker, save, restore, resume (207ms)
✔ build temp file; compress to cache file
|
Transcript Suffix ReplayBefore snapshots, reloading a vat meant replaying all of its transcript. The entire goal of using snapshots is to avoid the cost of this complete replay. We need to keep the full transcript around for other reasons (it's our only hope for certain kinds of upgrade), but we don't want to replay the whole thing each time a vat is paged in. On the other hand, we don't want to have to take a snapshot after every single delivery either. There's some sort of now-vs-later performance tradeoff to be made, and the VatWarehouse should be in control. Taking a snapshot requires a certain amount of time, and grows the disk footprint by some amount (less if/when we figure out a way to delete unused snapshots). But simply appending to the transcript and not updating the snapshot increases the time it will take to replay the next time. The ideal case would be to take a snapshot just before the vat is paged out, but the whole process could be killed at any moment. It's the same kind of question that journal-based filesystems must face, and we should probably follow their lead. That might mean writing a snapshot every N cranks, and/or when the VatWarehouse decides to push the vat out of RAM in favor of some more active vat. When restoring a vat ("paging it in" / "bringing it online"), the kernel needs to load a snapshot, and then replay the suffix of the transcript: just the deliveries that happened (and were committed) after the snapshot was taken. Rather than delete transcript entries when we write the snapshot, we just update a starting pointer, which is just the length of the transcript at the time of the snapshot. So for each vat, the kernel will remember:
We must record the ending position in the key-value store because of the atomicity requirements described below. Transcript/Snapshot Atomicity Management@FUDCo 's "streamStore" (#3065) provides an API for managing transcripts as a linear sequence of entries, with commit semantics that meet our needs for atomic transactions. The tricky part is that, like the snapstore, we're saving data outside of the real database (in ordinary files), but we want it to look like it's getting committed at the same time. The threat is that the kernel writes a transcript entry or snapshot, writes some associated changes to the DB, and is about to tell the DB to commit, when the entire computer loses power. When it wakes back up, it will see the DB in the old state, but the files in the new state. The host might take a different path this time than it did last time (it didn't commit to anything the last time, it's free to do whatever it likes). Any data that gets recorded, but isn't supposed to be there (because it doesn't match the committed DB state) is like an echo of an alternate timeline, and we need the outside-the-DB storage API to prevent these "future echoes" from appearing during the next restart. That's why the snapstore returns a snapshot ID when writing, and the kernel (VatWarehouse) is responsible for storing this ID in the DB. If the process (or entire computer) crashes in between writing the snapshot and committing the DB, the new process will observe the old snapshot ID in the DB, and ignore the new non-committed one entirely. To maintain this, we need the HostStorage instance (which includes the kernel DB key-value store, the snapstore, and the streamstore), plus the code which uses it, to give us the following properties:
StreamStore APIThe streamStore API is like:
Every time a delivery is made, the new transcript entry is added by doing:
Which ensures that the kvStore (LMDB) commits the right position, even though the underlying linear file may have extra junk at the end (future echoes). VatWarehouse behaviorWhen the VatWarehouse needs to page in a vat, it should:
When the VatWarehouse makes a delivery to the vat, the VatWarehouse and the transcript-writing code in
When the VatWarehouse decides to make a snapshot, it should:
|
currently, the VatWarehouse doesn't know anything about one managerType vs. another. Launching the xsnap process is done by the VatManager on creation, which happens inside vatLoad.js. Are you suggesting I should change all the guts of all the VatManagers and vatLoad.js around somehow? Care to elaborate?
currently, the VatWarehouse delegates transcript replay to the vat manager (which gets an implementation from manager-helper.js) |
We'll change the transcript replay code from (read everything from the transcript and deliver each one) to (read @FUDCo is likely to be implementing the transcript-manager changes, but his work is somewhat blocked on my review of the streamStore implementation (#3160), which I hope to get done today. You two will have to coordinate on the vatManager API and who calls what. |
Restoring vat from snapshot works in one caseThe good news:
|
@warner suggested:
|
Is this ticket for the whole "save/load snapshots" feature, or just the DB-ish thing that stores them? If the latter, can we close it? |
Good question; I have been struggling with the scope here. #2273 (comment) is an attempt at a checklist from the design description. #2370 added the DB-ish thing but didn't cover things such as kernelDB integration, so I didn't close this issue. Perhaps I should have split the issue when we closed the PR? I also struggle with the "Design ... API ..." title pattern, which suggests designing the API separately from implementing it. I might do better with "Prototype ..." since usually writing code is the way I explore the design space. Sometimes I can do separate design, but certainly in this case, it was only by digging into the kernel code that I had any idea what I was doing. And the issue description laid out a fairly thorough design. There does come a point, after doing some hack-and-slash style coding, that a more clear design emerges, and the work changes to producing production code. At that point, it might be straightforward to close the "Prototype ..." issue and open one or more enhancement issues. |
This enhances SwingSet to have a "Vat Warehouse" which limits the number of "paged-in" vats to some maximum (currently 50). The idea is to conserve system RAM by allowing idle vats to remain "paged-out", which consumes only space on disk, until someone sends a message to them. The vat is then paged in, by creating a new xsnap process and reloading the necessary vat state. This reload process is greatly accelerated by loading a heap snapshot, if one is available. We only need to replay the suffix of the transcript that was recorded after the snapshot was taken, rather than the full (huge) transcript. Heap snapshots are stored in a new swingstore component named the "stream store". For each vat, the warehouse saves a heap snapshot after a configurable number of deliveries (default 200). In addition, it saves an initial snapshot after just a few deliveries (default 2), because all contracts vats start out with a large delivery that provides the contract bundle to evaluate. By taking a snapshot quickly, we can avoid the time needed to re-evaluate that large bundle on almost all process restarts. This algorithm is a best guess: we'll refine it as we gather more data about the tradeoff between work now (the time it takes to create and write a snapshot), the storage space consumed by those snapshots, and work later (replaying more transcript). We're estimating that a typical contract snapshot consumes about 300kB (compressed). closes #2273 closes #2277 refs #2422 refs #2138 (might close it) * refactor(replay): hoist handle declaration * chore(xsnap): clarify names of snapStore temp files for debugging * feat(swingset): initializeSwingset snapshots XS supervisor - solo: add xsnap, tmp dependencies - cosmic-swingset: declare dependencies on xsnap, tmp - snapshotSupervisor() - vk.saveSnapshot(), vk.getLastSnapshot() - test: mock vatKeeper needs getLastSnapshot() - test(snapstore): update snapshot hash - makeSnapstore in solo, cosmic-swingset - chore(solo): create xs-snapshots directory - more getVatKeeper -> provideVatKeeper - startPos arg for replayTransript() - typecheck shows vatAdminRootKref could be missing - test pre-SES snapshot size - hoist snapSize to test title - clarify SES vs. pre-SES XS workers - factor bootWorker out of bootSESWorker - hoist Kb, relativeSize for sharing between tests misc: - WIP: restore from snapshot - hard-code remote style fix(swingset): don't leak xs-worker in initializeSwingset When taking a snapshot of the supervisor in initializeSwingset, we neglected to `.close()` it. Lack of a name hindered diagnosis, so let's fix that while we're at it. * feat(swingset): save snapshot periodically after deliveries - vk.saveSnapShot() handles snapshotInterval - annotate type of kvStore in makeVatKeeper - move getLastSnapshot up for earlier use - refactor: rename snapshotDetail to lastSnapshot - factor out getTranscriptEnd - vatWarehouse.maybeSaveSnapshot() - saveSnapshot: - don't require snapStore - fix startPos type - provide snapstore to vatKeeper via kernelKeeper - buildKernel: get snapstore out of hostStorage - chore: don't try to snapshot a terminated vat * feat(swingset): load vats from snapshots - don't `setBundle` when loading from snapshot - provide startPos to replayTranscript() - test reloading a vat * refactor(vatWarehouse): factor out, test LRU logic * fix(vat-warehouse): remove vatID from LRU when evicting * chore(vatKeeper): prune debug logging in saveSnapshot (FIXUP) * feat(swingset): log bringing vats online (esp from snapshot) - manager.replayTranscript returns number of entries replayed * chore: resove "skip crank buffering?" issue after discussion with CM: maybeSaveSnapshot() happens before commitCrank() so nothing special needed here * chore: prune makeSnapshot arg from evict() Not only is this option not implemented now, but CM's analysis shows that adding it would likely be harmful. * test(swingset): teardown snap-store * chore(swingset): initial sketch of snapshot reload test * refactor: let itemCount be not-optional in StreamPosition * feat: snapshot early then infrequently - refactor: move snapshot decision up from vk.saveSnapshot() up to vw.maybeSaveSnapshot * test: provide getLastSnapshot to mock vatKeeper * chore: vattp: turn off managerType local work-around * chore: vat-warehouse: initial snapshot after 2 deliveries integration testing shows this is closer to ideal * chore: prune deterministic snapshot assertion oops. rebase problem. * chore: fix test-snapstore ld.asset rebase / merge problem?! * chore: never mind supervisorHash optimization With snapshotInitial at 2, there is little reason to snapshot after loading the supervisor bundles. The code doesn't carry its own weight. Plus, it seems to introduce a strange bug with marshal or something... ``` test/test-home.js:37 36: const { board } = E.get(home); 37: await t.throwsAsync( 38: () => E(board).getValue('148'), getting a value for a fake id throws Returned promise rejected with unexpected exception: Error { message: 'Remotable (a string) is already frozen', } ``` * docs(swingset): document lastSnapshot kernel DB key * refactor: capitalize makeSnapStore consistently * refactor: replayTranscript caller is responsible to getLastSnapshot * test(swingset): consistent vat-warehouse test naming * refactor(swingset): compute transcriptSnapshotStats in vatKeeper In an attempt to avoid reading the lastSnapshot DB key if the t.endPosition key was enough information to decide to take a snapshot, the vatWarehouse was peeking into the vatKeeper's business. Let's go with code clarity over (un-measured) performance. * chore: use harden, not freeze; clarify lru * chore: use distinct fixture directories to avoid collision The "temporary" snapstore directories used by two different tests began to overlap when the tests were moved into the same parent dir, and one test was deleting the directory while the other was still using it (as well as mingling files at runtime), causing an xsnap process to die with an IO error if the test were run in parallel. This changes the the two tests to use distinct directories. In the long run, we should either have them use `mktmp` to build a randomly-named known-unique directory, or establish a convention where tempdir names match the name of the test file and case using them, to avoid collisions as we add more tests. Co-authored-by: Brian Warner <warner@lothar.com>
What is the Problem Being Solved?
When an
xsnap
-based vat worker is told to snapshot the heap state, the result is a big chunk of bytes (roughly 430kB for an empty heap, which compresses down to maybe 32kB). We must nominally record this snapshot in the same atomic kernelDB commit that records the truncation of the transcript (and the results of the most recent crank). The snapshot size, even compressed, is too large to comfortably live in the database, but it must obey the same transactionality: if the swingset process exits without committing the block, the successor process must see the earlier state.Description of the Design
I'm thinking that we should create/manage immutable snapshot files, use a hash-based identifier to name each one, keep a pool of snapshots indexed by their hash, and store only the hashes the the database.
When a snapshot is taken, we have xsnap write the snapshot to a temporary file. Then we compress that file into a different temporary file. We hash (SHA512, or blake3 if we wanna be like the cool kids) either the original snapshot or the compressed version to create a hex identifier string. Then we rename the compressed file to that hex string, and move it into the pool directory. We return the hex string to the caller, who records it in the DB (next to the newly-truncated transcript) as the current state of the vat.
When creating a vat from a snapshot, the pool manager is given the hex string. This manager finds the matching file in the pool, decompresses it, and uses the decompressed data in the
xsnap
"load" command.Files in the pool are never modified. The filenames do not identify the vat they were taken from. The compression step should be deterministic (keep an eye out for
zlib
/gzip
timestamps) on general principles.We'll need some sort of cleanup mechanism: if the fully-committed DB state does not reference a snapshot ID, the corresponding file can be deleted from the pool. This deletion must not happen until the DB state is finalized as part of the historical chain state, which happens after the Tendermind voting round on a block which contains that DB state. Most of that time, that happens within 5 seconds of the snapshot operation, but various exception conditions could make it happen much later.
Perhaps the kernel DB should include a reference count for the snapshot IDs, so an external process can easily look at the DB and tell whether a given snapshot file can be deleted. It might improve efficiency if each time the kernel modifies a vat record with a new snapshot id, it publishes the old snapshot ID to the pool manager as a "maybe ready for deletion" candidate. Then, after some period of time, the pool manager can query the DB for the reference counts of all candidates to figure out which ones to delete, rather than performing a full sweep of all snapshots.
The kernel setup calls accept a "swingStore" argument which gives the kernel access to the DB. (The host provides this swingstore because only the host knows when a block is complete and the data should be committed). We need an API that also gives the snapshot pool manager enough information to do its job, perhaps a subdirectory of the swingset base directory. We might fold this into the swingStore, with APIs beyond the key-value get/set/delete/getKeys methods: these additional methods would bypass the "block buffer" commit mechanism.
Security Considerations
We should probably verify the hash as we read the file back in and treat any differences as an unreadable file.
Test Plan
cc @FUDCo @michaelfig
The text was updated successfully, but these errors were encountered: