idea: remediation vat, dirty-vat-flag -driven BOYD #8877

warner · 2024-02-09T01:12:24Z

What is the Problem Being Solved?

I've been thinking about how we remediate the current large vats on mainnet. As of 08-feb-2024 (run-21), we have:

282,407 cycles (v9-zoe c24 seatHandleToZoeSeatAdmin) for ExitObject/SeatHandle cross-vat reference cycle retains old objects #8401
1,168,350 QuotePayments for price-feed vats hold old QuotePayments in recovery set, causing storage leak #8400, summed across:
- 410k in v29-ATOM-USD_price_feed
- 308k in v46-scaledPriceAuthority-ATOM
- 257k in v68-stATOM-USD_price_feed
- 193k in v69-scaledPriceAuthority-stATOM

The QuotePayments are mostly consuming vatstore slots (about 167k of the scaled-authority ones are weakly recognizable from vat-board, but none of the rest are exported). Terminating these vats will cause a large burst of kernel time to delete all that state on behalf of the vat, as well as the GC retirement actions for the exported ones.

The Zoe cycles can be broken by terminating the non-Zoe vat involved, and most of the cycles are to the price feed vats (simply because they have processed so many offers). In addition to the QuotePayment cleanup work, terminating these vats will trigger a huge GC drop action, as the dead price feed vats stop importing Zoe's seatHandle exports. The next v9-zoe BOYD will then process 280k-ish dead refereneces, doing a dozen syscalls on each one, and then retiring the same number of exported objects.

I've got a collection of tasks/features that would help to handle this burst of activity, by rate-limiting all the various parts:

when the kernel terminates a vat, can we spread the state deletion work out over time?
likewise, can we spread out the c-list deletion (export abandonment, import drop)?
if we do wind up with a huge GC-action queue, can we process it incrementally, avoiding O(N) expense (eg we currently JSON.parse the queue contents from the one kernelKeeper/kvStore entry, process it, and then put back the remainder with JSON.stringify)
can the kernel choose a subset of the GC-action queue to process in a single crank? how do we prioritize/schedule that work vs other deliveries? currently we drain the GC-action queue before doing anything else, and we don't count that work against the computron limit, so it will all happen in a single block, plus doing the deliveries it would have done anyways
can liveslots be told to bringOutYourDead with a limited budget, so it only processes part of possiblyDeadSet, leaving the rest for later? how should the kernel prioritize/schedule this work, so it all happens eventually, but we reserve time for useful work too?
can liveslots recognize when a large collection is being deleted (either because someone called .clear() on it, or because the collection itself became unreachable and BOYD collected it), and spread the entry deletion work out over time? Again, how should this be driven and scheduled?

We plan to replace these price-feed vats (v29, v46, v68, v69) rather than upgrade them, because their code is not designed to remain functional after an upgrade. Our current plan is to leave them "parked", unreferenced and unused, until we figure out a remediation scheme. So the cycle/QuotePayment problem will stop growing, but we won't actually reclaim any space until some time in the future.

It occurred to me that we might get a faster solution by rate-limiting the source, in the original price-feed vats, by "upgrading" them to a special image whose only job is to (slowly) shed its state. This would be a generic "wind-down" vat image, not using liveslots (making it a "raw vat"), which would delete a little bit of state each time it is invoked, and when the last bit of state is gone, it terminates itself.

Description of the Design

First, I think we'd change the way BOYD is scheduled. Currently, we track a counter on each vat (the reapCountdown), and trigger a BOYD every reapInterval deliveries (currently 1000). In addition, every kernelKeeper.getSnapshotInterval() (currently 200) deliveries we perform a heap snapshot, which does a BOYD as a side-effect. @mhofman has advocated for a computron-based schedule, which would have the great property that it would mostly-directly limit the vat-page-in replay cost. But I'm thinking that we go for a more aggressive "keep it clean" scheduler:

keep a list of "dirty vats"
every time we send a non-BOYD delivery into a vat, add it to the end of the list (if it wasn't there already)
- GC deliveries like dispatch.dropExports mark the vat dirty too, just as much as dispatch.deliver
we add a return value to BOYD which let the vat say "I'm not done yet"
every time we send a BOYD delivery into a vat, remove it from the list
- but if the BOYD returns True, add it back to the end of the list
add a controller.runCleanup(), which performs a BOYD of the one vat at the front of the list, if any

Then the host application can arrange to do runCleanup() at the end of the other runs, if and only if none of the other runs did any work.

(obviously this needs some more thinking, like sometimes doing a BOYD anyways even if the chain is never idle, maybe with a "really dirty list" or something that tracks computrons in addition to a dirty flag)

On mainnet, the most common action is a PushPrice, which touches six vats (8 deliveries to v10-bridge, 18 to v43-walletFactory, 11 to v9-zoe, 7 to v7-board, 7 to a price-feed vat like v68-stATOM-USD_price_feed, and 1 to v5-timer). The order of vats being touched is v10,v43,v7,v68,v9,v5. Assuming the chain is mostly idle, this scheduler would cause a BOYD to v10 on the second block, which might cause GC actions to go in to other vats, but wouldn't disturb their places in the list. The third block would see a BOYD to v43, the fourth would BOYD v7, etc. The last BOYD would be in the seventh block (assuming none of the BOYDs caused the dirt to spread)

We have two price feeds right now (ATOM-USD and stATOM-USD), and we average about one PushPrice action per minute. We produce about 10 blocks in that time, so with this scheduler, each minute we'd see one PushPrice block, six cleanup blocks (with one BOYD each), and three empty blocks.

Next, we define a "wind-down vat image". I think this might just be liveslots, but with a special mode flag that tells it to act as a remediation tool (perhaps a new argument to startVat). In this mode, it wouldn't call the vat image's buildRootObject() function, in fact it wouldn't even importBundle the vat code at all. Every dispatch.deliver would immediately reject the result promise, every dispatch.notify would be ignored. The GC deliveries would modify the export-status table (vs.es.${vref}) but are otherwise ignored.

The real work would happen during BOYD. The wind-down image would have a list of cleanup work to do, and each BOYD lets it do a tiny little piece of this work (maybe 5 or 10 items). The cleanup work is organized entirely around the vatstore, where the goal is to delete all of it. The list would look like:

loop over all vs.vom.rc.${vref} (reference count) entries
- for vrefs like o-NN (imports/Presences), do a syscall.dropImport and syscall.retireImport
  - (TODO: think about vs.vom.ir, maybe check both at the same time)
- in all cases, delete the entry
- keep going until they are all gone
then, loop over all vs.vom.ir (weak/recognizable reference) entries
- for imported vrefs, do a syscall.retireImport
- in all cases, delete the entry
then, loop over all vs.vom.es.${baseRef} (export status) entries
- compute the specific vref for each facet of the baseref record
  - for facets whose export-status is r ("reachable"), do a syscall.abandonExport()
  - for facets whose export-status is s ("seen" aka recognizble), do a syscall.retireExport()
- delete the .es entry
- keep going until they are all gone
then, loop over all vs.vom (virtual-object state data) entries and delete each one
- this will also delete the metadata records: vom.dkind.NN.nextID and .descriptor
then, loop over all vs.vc (virtual-collection) entries and delete each one
- this will also delete the metadata records: vc.NN.|entryCount, |nextOrdinal, |schemata, and the ordinal records
finally, invoke syscall.exitVat('cleanup complete')

Each BOYD would process 5-10 items, make the relevant syscalls (and always a vatstoreDelete of the item processed), then return true. Once the last item is deleted and syscall.exitVat is invoked, BOYD switches to returning false (in case it gets called a few last times before the termination event is processed).

If we do only 10 of these at a time, we shouldn't see more than 20 or 30 syscalls in each BOYD, which avoids concerns about large syscalls or large numbers of syscalls in a single delivery. By having the GC actions mark each vat as dirty, and cleaning them promptly, we avoid building up large possiblyDeadSet tables in the surviving vats.

The wind-down image doesn't manage refcounts, so when a VOM or virtual-collection value is deleted, it doesn't bother decrementing the outbound refcounts: it will get to those things eventually, or maybe it has already deleted them, but everything is getting deleted sooner or later, so the order doesn't matter. This lets us do far fewer syscalls than a real collection.clear() would require.

As previously-referenced imports are syscall.dropImports()'ed, the kernel will remove them from the c-list, and send dispatch.dropExports() into the upstream vats. This will dirty those vats, but only a little, so their BOYDs will be quick too.

We still need to honor the rules about what is legal to reference in syscalls: both vat and kernel must agree on the state of the c-list. If the kernel dispatch.dropExports() a vref first, the wind-down image must immediately change the export-status table from "reachable" to "recognizable", and thus must not do a syscall.abandonExports() on that vref later. But the wind-down image might reach that entry first, in which case the kernel should not do a dropExports later. Likewise, for unreachable imports, the kernel might do dispatch.retireImports() first, or the vat might do syscall.retireImports() first, and then the other one must not be done. The same is true for unreachable exports. So GC deliveries will manipulate table entries and cause vatstore changes, but these changes won't propagate refcount changes or provoke other syscalls: mostly they will inhibit other syscalls that would have happened later.

There are some low-cardinality vatstore keys that can either be deleted by a final wind-down pass, or left for the kernel to delete as part of the normal vat-termination function. Things like baggageID, idCounters, kindIDID, storeKindIDTable, watchedPRomiseTableID, and watcherTableID.

The act of upgrading to the wind-down image will abandon any merely-virtual exports, and disconnect/reject any promises, so upon entry, the c-list should only have imports and durable exports. By the time the wind-down code is complete, all the exports will be gone (abandoned), and the only imports left will be ones that were only held by RAM in the original image (since any held by virtual-data would have an rc refcount, and will be dropped). The vatstore does not have enough information to enumerate these remaining imports, but since this case was supposed to be low-cardinality (else the original vat would have been spending a lot of RAM on them), the quantity should be low.

At that point, vat-termination should only have to delete these leftover ephemeral imports from the c-list (propagating decrefs upstream to their exporting vats). Everything else should be gone by then, so termination should be cheap.

To drive this, I'm thinking a kernel API like controller.windDown(vatID), which enqueues a run-queue event similar to upgrade-vat (or how terminate-vat should probably be handled in the future). This would do the same final BOYD as upgrade-vat, but would then change the vat's metadata to indicate that we're in wind-down mode.

Completion Rates

On 23-jan-2024, v29 had 383k QuotePayment objects, and participated in some fraction of zoe's 258k cycles. It had about 5.2M vatstore keys. If we marked it as winding down and deleted 10 keys in each block, it would take 520k blocks, 3.1M seconds, or 36 days to finish remediating everything, in addition to whatever other vats are involved (so perhaps twice that time). At that rate, we could probably comfortably remediate all four price-feed vats in about four months of constant low-rate background work.

Compatibility Considerations

The wind-down vat needs detailed knowledge of how the previous vat's liveslots used the vatstore, so it can correctly interpret the data it finds there. If we release a new version of liveslots (which adds some new category of data), we add to the number of formats which the wind-down code might encounter. So it needs to both handle all such formats, and to have some way to determine which format is in use. This is the same requirement for new normal liveslots (a regular upgrade instead of a wind-down).

So at the very least, I think it makes sense for the wind-down code to be owned by the swingset-liveslots package. And it is probably a good idea for it to just be a variant of the normal liveslots code.

One option is to change the liveslots startVat(vatParametersCapData) delivery intoto startVat(vpcd, mode), and use mode = 'wind-down' to activate this behavior. Another is to have liveslots export both makeLiveSlots and makeWindDownAgent, and change supervisor-subprocess-xsnap.js to add a windDown command to the supervisor protocol, next to the existing setBundle and deliver. In this latter approach, the kernel wouldn't even provide a vat bundle to the worker, which would be faster and less confusing than to give it a bundle that never gets used.

Security Considerations

A "raw vat", which doesn't use liveslot, does not provide an ocap environment to any userspace code. The code is still confined to the vat as a whole (the raw code cannot forge c-list entries that weren't already granted to the previous vat), but it will be a different form of programming. We'll want to review it carefully to make sure it isn't calling syscall.send or syscall.resolve in ways that might exercise authority that was previously auditable under normal ocap rules.

Scaling Considerations

We'll need some way to decide how much work each step of the wind-down process should take. If we can change the worker protocol, we might add a budget argument to dispatch.bringOutYourDead() (just as in the rate-limited GC work), so at least it's the kernel's decision, where we can tweak things slightly easier. It might also be helpful for the worker to return some indication of how much work is left to do (if it can compute this cheaply, which is not a given), so the kernel's scheduler can be influenced.

A slightly higher-level approach would involve the wind-down vat using a Timer to schedule its own work. This would be harder to set up, and would prevent the kernel from adjusting the process in reaction to the chain/kernel as a whole being busy. However it wouldn't require much scheduling help from the kernel: instead of a constant stream of controller.runCleanup() in every block, the wind-down image would just quietly do periodic cleanup, and then eventually self-terminate.

In the long run, I'd prefer that the kernel have a way to schedule cleanup itself, rather than forcing application authors to decide when to run controller.runCleanup(). But I think it won't be hard to adapt this solution into a more generalized scheduler, later, which could have enough API surface for applications like the chain to be able to say "we didn't do much else in this (unit of work), feel free to do some cleanup now".

Test Plan

Unit tests on the kernel, to demonstrate the functionality works.

Somehow add unit test to cosmic-swingset, to demonstrate that it can be invoked correctly.

Manual performance tests on a main-fork image which winds down a large price-feed vat, to measure how fast the deletion proceed proceeds, and how much load it represents.

Upgrade Considerations

We should be able to safely upgrade to a kernel capable of doing this, without actually triggering the behavior. An application-level upgrade handler would need to trigger the behavior.

We might consider a userspace trigger instead, similar to adminNode~.terminateVat(). In fact we might decide that any userspace call to terminateVat() should really start the winddown process, and only have the vat itself trigger the real termination.

The text was updated successfully, but these errors were encountered:

Chris-Hibbert · 2024-02-09T17:35:24Z

we might get a faster solution by rate-limiting the source, ..., by "upgrading" them to a special image whose only job is to (slowly) shed its state. This would be a generic "wind-down" vat image, ..., which would delete a little bit of state each time it is invoked, and when the last bit of state is gone, it terminates itself.

Would the kernel be responsible for invoking it occasionally? It shouldn't have any remaining clients that would talk to it.

warner · 2024-02-09T18:13:40Z

Yes, in the design above, the kernel would BOYD any vat that's marked as "dirty", and only mark the vat as "clean" if the BOYD reported that it was complete. The remediation vat wouldn't report complete until every vatstore key was gone, and then it would terminate itself. So the kernel would just keep doing BOYD over and over again (in empty blocks only) until cleanup was complete.

warner added enhancement New feature or request SwingSet package: SwingSet liveslots requires vat-upgrade to deploy changes labels Feb 9, 2024

warner mentioned this issue Feb 15, 2024

terminate vats slowly #8928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea: remediation vat, dirty-vat-flag -driven BOYD #8877

idea: remediation vat, dirty-vat-flag -driven BOYD #8877

warner commented Feb 9, 2024 •

edited

Loading

Chris-Hibbert commented Feb 9, 2024

warner commented Feb 9, 2024

idea: remediation vat, dirty-vat-flag -driven BOYD #8877

idea: remediation vat, dirty-vat-flag -driven BOYD #8877

Comments

warner commented Feb 9, 2024 • edited Loading

What is the Problem Being Solved?

Description of the Design

Completion Rates

Compatibility Considerations

Security Considerations

Scaling Considerations

Test Plan

Upgrade Considerations

Chris-Hibbert commented Feb 9, 2024

warner commented Feb 9, 2024

warner commented Feb 9, 2024 •

edited

Loading