idea: remediation vat, dirty-vat-flag -driven BOYD #8877
Labels
enhancement
New feature or request
liveslots
requires vat-upgrade to deploy changes
SwingSet
package: SwingSet
What is the Problem Being Solved?
I've been thinking about how we remediate the current large vats on mainnet. As of 08-feb-2024 (run-21), we have:
seatHandleToZoeSeatAdmin
) for ExitObject/SeatHandle cross-vat reference cycle retains old objects #8401The QuotePayments are mostly consuming vatstore slots (about 167k of the scaled-authority ones are weakly recognizable from vat-board, but none of the rest are exported). Terminating these vats will cause a large burst of kernel time to delete all that state on behalf of the vat, as well as the GC retirement actions for the exported ones.
The Zoe cycles can be broken by terminating the non-Zoe vat involved, and most of the cycles are to the price feed vats (simply because they have processed so many offers). In addition to the QuotePayment cleanup work, terminating these vats will trigger a huge GC drop action, as the dead price feed vats stop importing Zoe's
seatHandle
exports. The next v9-zoe BOYD will then process 280k-ish dead refereneces, doing a dozen syscalls on each one, and then retiring the same number of exported objects.I've got a collection of tasks/features that would help to handle this burst of activity, by rate-limiting all the various parts:
JSON.parse
the queue contents from the one kernelKeeper/kvStore entry, process it, and then put back the remainder withJSON.stringify
)bringOutYourDead
with a limited budget, so it only processes part ofpossiblyDeadSet
, leaving the rest for later? how should the kernel prioritize/schedule this work, so it all happens eventually, but we reserve time for useful work too?.clear()
on it, or because the collection itself became unreachable and BOYD collected it), and spread the entry deletion work out over time? Again, how should this be driven and scheduled?We plan to replace these price-feed vats (v29, v46, v68, v69) rather than upgrade them, because their code is not designed to remain functional after an upgrade. Our current plan is to leave them "parked", unreferenced and unused, until we figure out a remediation scheme. So the cycle/QuotePayment problem will stop growing, but we won't actually reclaim any space until some time in the future.
It occurred to me that we might get a faster solution by rate-limiting the source, in the original price-feed vats, by "upgrading" them to a special image whose only job is to (slowly) shed its state. This would be a generic "wind-down" vat image, not using liveslots (making it a "raw vat"), which would delete a little bit of state each time it is invoked, and when the last bit of state is gone, it terminates itself.
Description of the Design
First, I think we'd change the way BOYD is scheduled. Currently, we track a counter on each vat (the
reapCountdown
), and trigger a BOYD everyreapInterval
deliveries (currently 1000). In addition, everykernelKeeper.getSnapshotInterval()
(currently 200) deliveries we perform a heap snapshot, which does a BOYD as a side-effect. @mhofman has advocated for a computron-based schedule, which would have the great property that it would mostly-directly limit the vat-page-in replay cost. But I'm thinking that we go for a more aggressive "keep it clean" scheduler:dispatch.dropExports
mark the vat dirty too, just as much asdispatch.deliver
controller.runCleanup()
, which performs a BOYD of the one vat at the front of the list, if anyThen the host application can arrange to do
runCleanup()
at the end of the other runs, if and only if none of the other runs did any work.(obviously this needs some more thinking, like sometimes doing a BOYD anyways even if the chain is never idle, maybe with a "really dirty list" or something that tracks computrons in addition to a dirty flag)
On mainnet, the most common action is a PushPrice, which touches six vats (8 deliveries to v10-bridge, 18 to v43-walletFactory, 11 to v9-zoe, 7 to v7-board, 7 to a price-feed vat like v68-stATOM-USD_price_feed, and 1 to v5-timer). The order of vats being touched is v10,v43,v7,v68,v9,v5. Assuming the chain is mostly idle, this scheduler would cause a BOYD to v10 on the second block, which might cause GC actions to go in to other vats, but wouldn't disturb their places in the list. The third block would see a BOYD to v43, the fourth would BOYD v7, etc. The last BOYD would be in the seventh block (assuming none of the BOYDs caused the dirt to spread)
We have two price feeds right now (ATOM-USD and stATOM-USD), and we average about one PushPrice action per minute. We produce about 10 blocks in that time, so with this scheduler, each minute we'd see one PushPrice block, six cleanup blocks (with one BOYD each), and three empty blocks.
Next, we define a "wind-down vat image". I think this might just be liveslots, but with a special mode flag that tells it to act as a remediation tool (perhaps a new argument to
startVat
). In this mode, it wouldn't call the vat image'sbuildRootObject()
function, in fact it wouldn't evenimportBundle
the vat code at all. Everydispatch.deliver
would immediately reject the result promise, everydispatch.notify
would be ignored. The GC deliveries would modify the export-status table (vs.es.${vref}
) but are otherwise ignored.The real work would happen during BOYD. The wind-down image would have a list of cleanup work to do, and each BOYD lets it do a tiny little piece of this work (maybe 5 or 10 items). The cleanup work is organized entirely around the vatstore, where the goal is to delete all of it. The list would look like:
vs.vom.rc.${vref}
(reference count) entrieso-NN
(imports/Presences), do asyscall.dropImport
andsyscall.retireImport
vs.vom.ir
, maybe check both at the same time)vs.vom.ir
(weak/recognizable reference) entriessyscall.retireImport
vs.vom.es.${baseRef}
(export status) entriesr
("reachable"), do asyscall.abandonExport()
s
("seen" aka recognizble), do asyscall.retireExport()
.es
entryvs.vom
(virtual-object state data) entries and delete each onevom.dkind.NN.nextID
and.descriptor
vs.vc
(virtual-collection) entries and delete each onevc.NN.|entryCount
,|nextOrdinal
,|schemata
, and the ordinal recordssyscall.exitVat('cleanup complete')
Each BOYD would process 5-10 items, make the relevant syscalls (and always a
vatstoreDelete
of the item processed), then returntrue
. Once the last item is deleted andsyscall.exitVat
is invoked, BOYD switches to returningfalse
(in case it gets called a few last times before the termination event is processed).If we do only 10 of these at a time, we shouldn't see more than 20 or 30 syscalls in each BOYD, which avoids concerns about large syscalls or large numbers of syscalls in a single delivery. By having the GC actions mark each vat as dirty, and cleaning them promptly, we avoid building up large
possiblyDeadSet
tables in the surviving vats.The wind-down image doesn't manage refcounts, so when a VOM or virtual-collection value is deleted, it doesn't bother decrementing the outbound refcounts: it will get to those things eventually, or maybe it has already deleted them, but everything is getting deleted sooner or later, so the order doesn't matter. This lets us do far fewer syscalls than a real
collection.clear()
would require.As previously-referenced imports are
syscall.dropImports()
'ed, the kernel will remove them from the c-list, and senddispatch.dropExports()
into the upstream vats. This will dirty those vats, but only a little, so their BOYDs will be quick too.We still need to honor the rules about what is legal to reference in syscalls: both vat and kernel must agree on the state of the c-list. If the kernel
dispatch.dropExports()
a vref first, the wind-down image must immediately change the export-status table from "reachable" to "recognizable", and thus must not do asyscall.abandonExports()
on that vref later. But the wind-down image might reach that entry first, in which case the kernel should not do adropExports
later. Likewise, for unreachable imports, the kernel might dodispatch.retireImports()
first, or the vat might dosyscall.retireImports()
first, and then the other one must not be done. The same is true for unreachable exports. So GC deliveries will manipulate table entries and cause vatstore changes, but these changes won't propagate refcount changes or provoke other syscalls: mostly they will inhibit other syscalls that would have happened later.There are some low-cardinality vatstore keys that can either be deleted by a final wind-down pass, or left for the kernel to delete as part of the normal vat-termination function. Things like
baggageID
,idCounters
,kindIDID
,storeKindIDTable
,watchedPRomiseTableID
, andwatcherTableID
.The act of upgrading to the wind-down image will abandon any merely-virtual exports, and disconnect/reject any promises, so upon entry, the c-list should only have imports and durable exports. By the time the wind-down code is complete, all the exports will be gone (abandoned), and the only imports left will be ones that were only held by RAM in the original image (since any held by virtual-data would have an
rc
refcount, and will be dropped). The vatstore does not have enough information to enumerate these remaining imports, but since this case was supposed to be low-cardinality (else the original vat would have been spending a lot of RAM on them), the quantity should be low.At that point, vat-termination should only have to delete these leftover ephemeral imports from the c-list (propagating decrefs upstream to their exporting vats). Everything else should be gone by then, so termination should be cheap.
To drive this, I'm thinking a kernel API like
controller.windDown(vatID)
, which enqueues a run-queue event similar toupgrade-vat
(or howterminate-vat
should probably be handled in the future). This would do the same final BOYD asupgrade-vat
, but would then change the vat's metadata to indicate that we're in wind-down mode.Completion Rates
On 23-jan-2024, v29 had 383k QuotePayment objects, and participated in some fraction of zoe's 258k cycles. It had about 5.2M vatstore keys. If we marked it as winding down and deleted 10 keys in each block, it would take 520k blocks, 3.1M seconds, or 36 days to finish remediating everything, in addition to whatever other vats are involved (so perhaps twice that time). At that rate, we could probably comfortably remediate all four price-feed vats in about four months of constant low-rate background work.
Compatibility Considerations
The wind-down vat needs detailed knowledge of how the previous vat's liveslots used the vatstore, so it can correctly interpret the data it finds there. If we release a new version of liveslots (which adds some new category of data), we add to the number of formats which the wind-down code might encounter. So it needs to both handle all such formats, and to have some way to determine which format is in use. This is the same requirement for new normal liveslots (a regular upgrade instead of a wind-down).
So at the very least, I think it makes sense for the wind-down code to be owned by the
swingset-liveslots
package. And it is probably a good idea for it to just be a variant of the normal liveslots code.One option is to change the liveslots
startVat(vatParametersCapData)
delivery intotostartVat(vpcd, mode)
, and usemode = 'wind-down'
to activate this behavior. Another is to have liveslots export bothmakeLiveSlots
andmakeWindDownAgent
, and changesupervisor-subprocess-xsnap.js
to add awindDown
command to the supervisor protocol, next to the existingsetBundle
anddeliver
. In this latter approach, the kernel wouldn't even provide a vat bundle to the worker, which would be faster and less confusing than to give it a bundle that never gets used.Security Considerations
A "raw vat", which doesn't use liveslot, does not provide an ocap environment to any userspace code. The code is still confined to the vat as a whole (the raw code cannot forge c-list entries that weren't already granted to the previous vat), but it will be a different form of programming. We'll want to review it carefully to make sure it isn't calling
syscall.send
orsyscall.resolve
in ways that might exercise authority that was previously auditable under normal ocap rules.Scaling Considerations
We'll need some way to decide how much work each step of the wind-down process should take. If we can change the worker protocol, we might add a
budget
argument todispatch.bringOutYourDead()
(just as in the rate-limited GC work), so at least it's the kernel's decision, where we can tweak things slightly easier. It might also be helpful for the worker to return some indication of how much work is left to do (if it can compute this cheaply, which is not a given), so the kernel's scheduler can be influenced.A slightly higher-level approach would involve the wind-down vat using a Timer to schedule its own work. This would be harder to set up, and would prevent the kernel from adjusting the process in reaction to the chain/kernel as a whole being busy. However it wouldn't require much scheduling help from the kernel: instead of a constant stream of
controller.runCleanup()
in every block, the wind-down image would just quietly do periodic cleanup, and then eventually self-terminate.In the long run, I'd prefer that the kernel have a way to schedule cleanup itself, rather than forcing application authors to decide when to run
controller.runCleanup()
. But I think it won't be hard to adapt this solution into a more generalized scheduler, later, which could have enough API surface for applications like the chain to be able to say "we didn't do much else in this (unit of work), feel free to do some cleanup now".Test Plan
Unit tests on the kernel, to demonstrate the functionality works.
Somehow add unit test to cosmic-swingset, to demonstrate that it can be invoked correctly.
Manual performance tests on a main-fork image which winds down a large price-feed vat, to measure how fast the deletion proceed proceeds, and how much load it represents.
Upgrade Considerations
We should be able to safely upgrade to a kernel capable of doing this, without actually triggering the behavior. An application-level upgrade handler would need to trigger the behavior.
We might consider a userspace trigger instead, similar to
adminNode~.terminateVat()
. In fact we might decide that any userspace call toterminateVat()
should really start the winddown process, and only have the vat itself trigger the real termination.The text was updated successfully, but these errors were encountered: