kernel should retire abandoned non-reachable objects #7212

warner · 2023-03-22T01:18:15Z

In #6696 (comment) I asked @gibson042 to add a test for a vat-upgrade -time kernel behavior which, it turns out, the kernel does not already do, as his test in #7170 (comment) discovered.

The scenario I was thinking about is:

vatA exports a non-durable object to vatB
vatB uses the object as a key in a WeakMap, then drops the Presence
- vatB emits a syscall.dropImport, but not a syscall.retireImport
vatA receives a dispatch.dropExport, but maintains an internal strong reference, so does not emit a syscall.retireExport
the object's refcount is now 0,1, which means "not reachable", but "yes recognizable" (by vatB)
now vatA is terminated or upgraded

I mistakenly believed that the kernel would then send a dispatch.retireImport into vatB. The reasoning is that:

vatB cannot reach the object
no other vat can reach the object
the only other source of a strong reference to the object is vatA, which is dead (or upgraded and the object wasn't durable)
since nobody will ever be able to produce a strong ref to the object, it can never become reachable again
so vatB (or any other recognizers) ought to delete their now-useless WeakMap entries

But, as test-abandon-export.js shows us, the kernel doesn't do that yet. The kernel object table is updated to show that the object has been abandoned (owner = null), but the refcount is unchanged. The entry will remain until all importing vats retire the import themselves, and that will only happen if their WeakMap gets deleted (probably never, they're usually long-lived).

So the feature to add here is for the kernel to notice when an object with refcount=0,* (i.e. unreachable) becomes orphaned. This can happen because the vat did syscall.abandonExport, or because the kernel's processUpgradeVat() abandoned the object on behalf of a vat being upgraded (#6696), or because the vat was terminated. The kernel should respond to this as if the previously-exporting vat did a syscall.retireExport: it should find all importing/recognizing vats and send them a dispatch.retireImport.

Since we're unlikely to implement this before deploying the kernel in the Vaults release, we must also be able to catch up on unretired unreachable orphans that were created and abandoned during the "deployed but not fixed" window. We'll create an upgrade handler that walks the kernel object table, looking for entries where owner = null and refCount.reachable = 0. For each one, we should perform the retireExport chores. When complete, those entries will be deleted.

We don't know how many such orphaned objects there will be. If there might be a lot of them, we should consider building an upgrade handler which does a limited amount of work in each block (perhaps some number of computrons can be budgeted towards this chore, and once that budget is exceeded, we swtich over to normal deliveries). It should start at ko0 and work lexicographically upwards until all entries have been examined. The handler would need some persistent state to remember it's progress between blocks. The edge-triggered code would also be on the lookout for newly-orphaned or newly-unreachable objects, so if the little-at-a-time loop passed by an earlier object (orphaned but still reachable), then when that object becomes unreachable later, the GC actions should still fire.

The text was updated successfully, but these errors were encountered:

gibson042 · 2023-03-27T03:27:14Z

When fixing this issue, remember to update test assertions at

agoric-sdk/packages/SwingSet/test/upgrade/test-upgrade.js

Line 622 in f447e3a

// TODO: Expect and verify observer receipt of dispatch.retireExports

warner · 2023-12-21T21:16:31Z

We don't know how many such orphaned objects there will be

We still don't have many orphaned objects, but one of our remediation plans for #8401 is to terminate the price feed vats, and some of those have sent a lot of vrefs to v7-board, so those vrefs will enter this state (orphaned, weakly referenced by a remaining vat) when we trigger that fix.

As of this morning (21-dec-2023, just before agoric-upgrade-13 activated, in the agreeable follower's "run-17" dataset), v46-scaledPriceAuthority-ATOM has 81,382 exported-merely-recognizable objects, and a random sampling suggests that all are (weakly) imported by v7-board. Another 36,052 come from v69-scaledPriceAuthority-stATOM. Between the two of them, they account for 117,434 of the 118,478 weak imports of v7-board.

When we delete v46 or v69 (replacing them with a new price authority), these objects will fall into the category that needs this issue fixed to free those WeakMap entries in v7-board.

Chris-Hibbert · 2023-12-21T21:45:34Z

It should start at ko0 and work lexicographically upwards until all entries have been examined. The handler would need some persistent state to remember its progress between blocks.

My plan in remediating empty payments in the recoverySets of escrowPurses (#8686) is to delete up to N objects each cycle, and run each cycle over the entire recoverSet. Once one of the scans finds fewer than N objects, I'll set a flag to never rescan for that purse. As long as scanning for deletable objects is much faster/cheaper than doing the deletion, you don't need any extra record-keeping between incremental cycles.

mhofman · 2023-12-22T01:33:48Z

If there might be a lot of them, we should consider building an upgrade handler which does a limited amount of work in each block (perhaps some number of computrons can be budgeted towards this chore, and once that budget is exceeded, we swtich over to normal deliveries).

As you know I am not of fan of upgrade related work staggered after an upgrade has been "performed".

orphaned but still reachable), then when that object becomes unreachable later, the GC actions should still fire.

Shouldn't that be part of the fixed behavior to immediately retire these objects?

warner · 2023-12-23T17:18:55Z

I wrote a tool to scan our mainnet state for krefs in this state: as of block 13017175 (2023-12-21T12:50:08Z), there was only one. ko111, unowned, imported weakly by v9-zoe as o-54.

% gzcat run-1-chain.slog.gz |grep '\bko111\b'
{"type":"clist","crankNum":381,"mode":"export","vatID":"v12","kobj":"ko111","vobj":"o+d10/1","time":1687198304.521085,"monotime":83.21812876300001}
{"type":"syscall","crankNum":381,"vatID":"v12","deliveryNum":7,"syscallNum":105,"replay":false,"ksc":["send","ko107",{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["ko110","ko111"]},"result":"kp134"}],"vsc":["send","o-52",{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["o+d11/1","o+d10/1"]},"result":"p+7"}],"time":1687198304.521188,"monotime":83.2182318500001}
{"type":"syscall","crankNum":381,"vatID":"v12","deliveryNum":7,"syscallNum":111,"replay":false,"ksc":["send","ko107",{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["ko80","ko111"]},"result":"kp136"}],"vsc":["send","o-52",{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["o-57","o+d10/1"]},"result":"p+9"}],"time":1687198304.522767,"monotime":83.21981093399971}
{"type":"syscall","crankNum":381,"vatID":"v12","deliveryNum":7,"syscallNum":115,"replay":false,"ksc":["send","ko107",{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["ko111"]},"result":"kp137"}],"vsc":["send","o-52",{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["o+d10/1"]},"result":"p+10"}],"time":1687198304.523542,"monotime":83.220585922}
{"type":"crank-start","crankType":"routing","crankNum":382,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["ko110","ko111"]},"result":"kp134"}},"time":1687198304.525159,"monotime":83.22220272699977}
{"type":"crank-start","crankType":"routing","crankNum":384,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["ko80","ko111"]},"result":"kp136"}},"time":1687198304.525452,"monotime":83.22249635499995}
{"type":"crank-start","crankType":"routing","crankNum":385,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["ko111"]},"result":"kp137"}},"time":1687198304.525559,"monotime":83.22260283599981}
{"type":"crank-start","crankType":"delivery","crankNum":386,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["ko110","ko111"]},"result":"kp134"}},"time":1687198304.525675,"monotime":83.22271906200005}
{"type":"clist","crankNum":386,"mode":"import","vatID":"v9","kobj":"ko111","vobj":"o-54","time":1687198304.525802,"monotime":83.22284589800006}
{"type":"deliver","crankNum":386,"vatID":"v9","deliveryNum":21,"replay":false,"kd":["message","ko107",{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["ko110","ko111"]},"result":"kp134"}],"vd":["message","o+d34/1",{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["o-53","o-54"]},"result":"p-72"}],"time":1687198304.525892,"monotime":83.22293607500009}
{"type":"crank-start","crankType":"delivery","crankNum":390,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["ko80","ko111"]},"result":"kp136"}},"time":1687198304.550052,"monotime":83.24709620600008}
{"type":"deliver","crankNum":390,"vatID":"v9","deliveryNum":23,"replay":false,"kd":["message","ko107",{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["ko80","ko111"]},"result":"kp136"}],"vd":["message","o+d34/1",{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["o+d10/1","o-54"]},"result":"p-74"}],"time":1687198304.550221,"monotime":83.24726528899977}
{"type":"crank-start","crankType":"delivery","crankNum":392,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["ko111"]},"result":"kp137"}},"time":1687198304.555307,"monotime":83.25235097000002}
{"type":"deliver","crankNum":392,"vatID":"v9","deliveryNum":24,"replay":false,"kd":["message","ko107",{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["ko111"]},"result":"kp137"}],"vd":["message","o+d34/1",{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["o-54"]},"result":"p-75"}],"time":1687198304.555461,"monotime":83.25250455599976}
{"type":"syscall","crankNum":2968,"vatID":"v9","deliveryNum":205,"syscallNum":791,"replay":false,"ksc":["dropImports",["ko223","ko262","ko111","ko114","ko122","ko162","ko203","ko205","ko208","ko211","ko214","ko217","ko220"]],"vsc":["dropImports",["o-101","o-114","o-54","o-55","o-59","o-80","o-81","o-83","o-86","o-89","o-92","o-95","o-98"]],"time":1687198333.106308,"monotime":111.80335262499983}

Inside v9-zoe, it appears as a weak key of zoe's seatHandleToZoeSeatAdmin WeakMapStore (vc.24), whose value is a zoeSeatAdmin.

% grep '\bo-54\b' v9.vs
v9.vs.vc.24.r0000000001:o-54|{"body":"#\"$0.Alleged: ZoeSeatKit zoeSeatAdmin\"","slots":["o+d31/1:1"]}
v9.vs.vc.24.|o-54|1
v9.vs.vom.ir.o-54|24|1

It was created by v12 (which got terminated early) as the SeatHandle of a makeNoEscrowSeat message sent to v9-zoe.

v12 was zcf-centralSupply-centralSupply, which served its purpose during bootstrap, and was terminated 83 seconds into the vaults release:

{"type":"terminate","vatID":"v12","shouldReject":false,"info":{"body":"#\"payment retrieved\"","slots":[]},"time":1687198304.605094,"monotime":83.30213865200011}

Since there's only one (until we kill the price-feed vats), we could survive not doing the scan-for-old-cases cleanup, at the cost of a few zoe objects being kept around forever.

Doing that scan with our current swing-store is expensive, because the only way to find these abandoned+unreachable krefs is to scan the entire kernel object table, accumulating two rows at a time (.owner and .refCounts), finding the krefs where the rows meet our criteria (.owner is missing, .refCounts starts with 0,), accumulating those krefs into a list, then retiring them. And, retiring a kref requires us to scan through the c-lists of all vats to find the ones that need to be notified (we don't have an index in that direction, #3223), making the cost O(numVats * numKrefs).

We might want to defer doing this cleanup until we've improved the way we store the kernel object table and c-lists (#6677). If we had an efficient query for all unowned objects, we could limit the scan to just them, and express it as something like SELECT kref FROM kernelObjects WHERE owner IS NULL AND REACHABLE = 0, this would be a lot faster. Likewise, our fix for #3223 would be SELECT vatID,vref FROM clists WHERE kref=?.

If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so as far as the rest of the world is concerned, the object has been retired. And because the owning vat can't retire it by itself, the kernel needs to do the retirement. TODO: do we handle entering this state from the other direction: when an orphaned object becomes unreachable? `processRefcounts()` has something but I'm not sure it's sufficient. closes #7212

warner · 2023-12-24T04:48:39Z

I think there are two cases to consider. The PR #8695 I just pushed only handles one of them.

case 1: an object is merely-recognizable, and then it gets abandoned by its owner (subcase A: the owner vat is terminated, subcase B: the owner vat calls syscall.abandonExports(), subcase C: the owner vat is upgraded and the vref was not durable)
case 2: an object is abandoned by its owner (all three subcases), then it ceases to be reachable (because the last importing vat did a syscall.dropImports(), or was terminated)

The PR handles case 1, which involves code in kernelKeeper.orphanKernelObjects() (which is called from kernel.js processUpgradeVat for subcase C, kernelKeeper.js cleanupAfterTerminatedVat for subcase A, and kernelSyscall.js abandonExports for subcase B). That function is changed to look for unreachable koids and submit them to retireKernelObjects().

But we still need to handle case 2. I think that wants to get handled in processRefcounts, in the clause where reachable === 0 but in the else clause of if (ownerVatID). Currently we handle the recognizable === 0 case, but we need to do something else if recognizable !== 0.

warner · 2023-12-24T19:08:48Z

Let's see, the full state space (as observed by processRefcounts() is basically:

object is owned or orphaned (because of termination, upgrade, or abandonment), times
- if owned, the owning vat knows it to be reachable or merely-recognizable
object is reachable, merely-recognizable, or unreferenced
- (unreferenced is temporary, GC actions will soon delete the kernel-object table entry)

stateDiagram-v2
  state "ORR: owned\n known-reachable\n reachable\n GCA: none" as ORR
  state "ORM: owned\n known-reachable\n merely-recognizable\n GCA: dropExport" as ORM
  state "ORN: owned\n known-reachable\n unreferenced\n GCA: dropExport\n GCA: retireExport" as ORN
  state "OMM: owned\n known-merely-recognizable\n merely-recognizable\n GCA: none" as OMM
  state "OMN: owned\n known-merely-recognizable\n unreferenced\n GCA: retireExport" as OMN
  state "?ONM: owned\n known-unreferenced\n merely-recognizable\n GCA: retireImport" as ONM
  state "PR orPhaned\n reachable\n GCA: none" as PR
  state "PM orPhaned\n merely-recognizable\n GCA: (TODO synth s.retireExports)\n GCA.retireImport\n delete" as PM
  state "deleted" as d
  ORR --> ORM : importer s.dropImports
  ORM --> ORR : re-import
  ORM --> ORN : importer s.retireImports
  ORM --> OMM : d.dropExports
  OMM --> OMN : importer s.retireImports
  OMM --> d : exporter s.retireExports\n\n GCA.retireImport\n delete
  OMM --> ORR : re-export
  OMM --> d : orphaned\n\n remove-owner\n GCA.retireImport\n delete
  ORN --> OMN : d.dropExports
  OMN --> d : d.retireExports\n\n delete
  ORR --> PR : orphaned\n\n remove-owner
  PR --> PM : importer s.dropImports
  PM --> d
  ko111 --> PM
  ? --> ONM

mhofman · 2024-05-08T20:32:54Z

now vatA is terminated or upgraded

I mistakenly believed that the kernel would then send a dispatch.retireImport into vatB. The reasoning is that:

vatB cannot reach the object

no other vat can reach the object

the only other source of a strong reference to the object is vatA, which is dead (or upgraded and the object wasn't durable)

I believe this is a wrong assumption. How does the kernel know the object wasn't durable? Whoever is in a position to discover this is the party responsible for generating a retireExport (or abandonExport really) on behalf of vatA. In #9338 I argue that the vat should clean up after its old incarnation after an upgrade (possibly with a backstop for non cooperating vats like #7170). In the case of a termination, the kernel can unilaterally abandon any exports the terminated vat had made.

Edit: I missed a step. I didn't realize that abandoning is already what is happening, it's just that the kernel doesn't consider an abandonment the same as if the vat had retired its export. Yeah that's weird.

But now I'm confused by the following:

The kernel object table is updated to show that the object has been abandoned (owner = null), but the refcount is unchanged.

We'll create an upgrade handler that walks the kernel object table, looking for entries where owner = null and refCount.reachable = 0.

I'm confused, is the exporter included in the reachable refcount? In that case, wouldn't these abandoned but not retired export have a reachable count of 1 since we didn't decref? Can we always assume that the exporter accounted for one of the ref?

mhofman · 2024-05-08T20:54:02Z

We'll create an upgrade handler that walks the kernel object table, looking for entries where owner = null and refCount.reachable = 0. For each one, we should perform the retireExport chores. When complete, those entries will be deleted.

This took me a minute to process, but it might be important to highlight that a vat abandoning an export does not mean the object should no longer be referenced by anyone. While sending a message to the object will splat, other vats may still share the reference and use it for its identity. As such the kernel cannot instruct the vats to forget about the reference unless the reference is unreachable by anyone.

warner · 2024-06-17T18:29:01Z

I'm confused, is the exporter included in the reachable refcount? In that case, wouldn't these abandoned but not retired export have a reachable count of 1 since we didn't decref? Can we always assume that the exporter accounted for one of the ref?

Nope, the exporter doesn't get a refcount (of either flavor), and processRefCounts triggers when the counts reach zero. (If the export counted, it would need to trigger when the count reached one).

The design is tuned to minimize space, potentially at the cost of increased churn. One consequence is that a single vref may have multiple (non-overlapping) krefs assigned to it over the lifetime of the vat. If vatA exports vref o+1, it gets assigned ko10, it gets imported and forgotten and retired, we send a dispatch.retireExports(o+1) into vatA, and as that is delivered, we delete ko10 from the kernel tables and remove o+1 <-> ko10 from the vatA clist. At that point, o+1 might still exist inside vatA, but the kernel doesn't know about it, and if/when vatA re-exports it, the kernel will assign it a new kref (maybe ko11).

The alternative would have been to have the kernel maintain the c-list entry until the exporting vat retired the vref, which is basically equivalent to having the kernel maintain its own recognizable count. For long-lived single-use objects, that would consume c-list and kernel-object-table space even for objects that are never used anymore. But it would improve the churn, so long-lived multiple-use objects which are dropped entirely by their importers (remembering that BOYD doesn't happen instantly, which might help reduce the churn, by extending the lifetime of the imports, perhaps enough to overlap those multiple uses).

I haven't done the analysis to determine how commonly we experience that churn. The tool would want to scan through the slogs and look at both the KernelDeliveryObject and VatDeliveryObject (also the syscall object pairs) and build a list of kref/vref pairs in a database. Then, after processing everything, count how many unique krefs were seen for each vref. This needs slogs, because the transcripts themselves only have vrefs. We could build a more complicated tool that only used the transcripts, by doing a stateful thing where we deduce the current contents of the c-lists (populate when the vref first appears, remove when a retire appears, assign unique pseudo-krefs at population time (which would each map to a real kref, except that this tool never sees the real krefs), and then do the same uniqueness processing. Actually, we could simplify that: just count how many times a retire appears for each vref.

This took me a minute to process, but it might be important to highlight that a vat abandoning an export does not mean the object should no longer be referenced by anyone. While sending a message to the object will splat, other vats may still share the reference and use it for its identity. As such the kernel cannot instruct the vats to forget about the reference unless the reference is unreachable by anyone.

Right, syscall.abandonExport causes the owner to be nulled out, but it doesn't affect the reachable refcount. We wind up with two cases:

reachable > 0: other vats can still emit strong references, and we won't retire the object until the last one goes away
reachable === 0: no vat can emit a strong reference (at best they can merely recognize it), so we should retire the object now

Abandoning an unreachable export is the trigger: vats can't send messages to it, nor can they share a reference, because all they've got is a WeakMap key, and they can't reach those.

warner · 2024-06-18T22:48:26Z

Ok I think I can simplify this into two pieces. The first is our rule that we push a kref onto maybeFreeKrefs whenever we decrement a refcount (either reachable or recognizable) and it touches zero, or (and this is new) when we clear the .owner of the kref (either by explicit syscall.abandonExport, the upgrade-time non-durable -export check, or the maybe-slow termination/deletion scan). That will make the end-of-crank processRefcounts() look at the kref.

The second is a table of checks/actions that processRefcounts() performs. As a reminder, our rule is that processRefcounts is safe to call too much: the system would be correct, just ridiculously inefficient, if we added every single kref to maybeFreeKrefs on every crank, and let processRefcounts rule out the ones that didn't need to be there. That reduces the refcount-changing methods' responsibility down to "add any kref that might need checking", and gives processRefcounts the responsibility of "only take action on krefs that actually need it".

When processRefcounts looks at each kref, it checks the koNN.owner and koNN.refcount = (reachableCount, recognizableCount) values, and it will classify it into one of the following six categories:

|                          | reachable (>=1,>=1) | recognizable (0,>=1) | neither (0,0)      |
|--------------------------+---------------------+----------------------+--------------------|
| owned (koNN.owner = vNN) | A: nothing          | B: d.dropExports     | C: d.retireExports |
| abandoned (= null)       | D: nothing          | E: d.retireImports   | F: delete          |

Case A: The kref is still reachable, so do nothing. This happens when a break-before-make handoff occurs, like decrementing the refcount as we take a message off the run-queue, then incrementing it as we translate it for delivery and add it to the receiving vat's c-list, so the refcount bounces off 0 briefly.

Case B: The kref is unreachable by other vats, but they can still recognize it. In this case, processRefcounts() needs to look at the exporting vat's c-list entry, which has an extra flag that says whether the export is reachable or recognizable (vNN.c.$kref is $R $vref, where R is either "R" for reachable or "_" for merely-recognizable). The code names this flag isReachable. If isReachable, we add a gcAction to deliver dispatch.dropExports() into the owning vat. The subsequent delivery (which bails if it finds the clist is no longer in that state) will clear the "R" flag.

If the kref is somehow re-added to maybeFreeKrefs and processRefcounts() is run again before the dispatch.dropExports() happens, we'll add a redundant gcAction, but since gcActions are held in a set, this won't cause multiple deliveries.

Case C: The kref is neither reachable nor recognizable by other vats, but it is still being exported. If isReachable then we need to add both dispatch.dropExports and dispatch.retireExports gcActions (we can hit this case when both reachable and recognizable refcounts are dropped on the same crank, which is pretty common, and happens during any BOYD when a vat wasn't using a WeakMap). If isReachable is false, then we only need to add dispatch.retireExports. As before, the gcAction delivery code will double-check that the calls are still appropriate, so we can tolerate changes (re-exports) that occur between processRefcounts() and the later delivery.

When dispatch.retireExports is delivered, the kernel will call kernelKeeper.deleteKernelObject(), which deletes both koNN.owner and koNN.refcounts. Then the translation call will delete the exporting vat's c-list entries. Combined, this removes the last of the kernel's knowledge of the kref. If/when the vat re-exports their vref, the kernel will assign it a new kref.

TODO: when do we enqueue dispatch.retireImports to the remaining recognizing vats? ANSWER: for case C it's irrelevant, all the other vats have already done syscall.retireImports. But in general, the dispatch.retireImports are enqueued by kernelKeeper.retireKernelObjects(), which is either called by the living vat when it does syscall.retireExports, or (will be done) by processRefCounts when it hits Case E.

Case D: The vat is either terminated, or the vref was non-durable and was abandoned by an upgrade. But, other vats can still reach it. This is fine, we don't need to tell anybody anything. The former owning vat is either dead or doesn't care. The importing vats keep importing it: any messages they send to it will go splat, but they can continue to tell each other about the object, and they can use it in WeakMaps. Nothing changes until the object becomes unreachable.

Case E: Like above, the vat is dead/upgraded, but other vats can merely recognize the object, not reach it. We can reach here in one of two ways: Case D plus a reachable-count decref, or case B plus a vat termination/upgrade. In either case, we want to add a dispatch.retireImports gcAction, which will tell all importing vats that the object has been retired, so they can drop their WeakMap entries.

Case F: No vat knows about the kref. We might get into this state if the vat was terminated/upgraded while the state-C gcActions were still queued, maybe. All we need to do is delete koNN.refcounts, to finish forgetting about the kref.

warner · 2024-06-25T00:28:39Z

Note to self, the TODO test that we're trying to fix is at

agoric-sdk/packages/SwingSet/test/upgrade/upgrade.test.js

Lines 710 to 722 in d941b39

    
           // TODO: Expect and verify observer receipt of dispatch.retireExports 
        
           // and corresponding removal of weakKref ref counts. 
        
           // https://github.com/Agoric/agoric-sdk/issues/7212 
        
           t.is( 
        
             kvStore.get(`${weakKref}.refCount`), 
        
             '0,1', 
        
             'weak observation of abandoned object retains ref counts before #7212', 
        
           ); 
        
           // t.is( 
        
           //   kvStore.get(`${weakKref}.refCount`), 
        
           //   undefined, 
        
           //   'unreachable abandoned object is forgotten', 
        
           // );

. I'm adding more tests to exercise the full set of states (to be named gc-kernel-orphan.test), which are currently failing for both syscall.abandonExports() and terminateVat(), happening either before or after an importing vat uses syscall.dropImports() to drop down to the merely-recognizable state. My new test is not exercising the upgrade/abandon-non-durables case.

If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so as far as the rest of the world is concerned, the object has been retired. And because the owning vat can't retire it by itself, the kernel needs to do the retirement. TODO: do we handle entering this state from the other direction: when an orphaned object becomes unreachable? `processRefcounts()` has something but I'm not sure it's sufficient. closes #7212

This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.

If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use orphanKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's orphanKernelObjects() will get processed promptly. Change getObjectRefCount tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212

This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.

If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. Changes getObjectRefCount to tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212

warner · 2024-06-27T21:27:24Z

remediation idea: add a controller-visible API that accepts krefs and submits them to maybeFreeKrefs, then does enough of a crank to trigger processRefcounts(). Then add a new cosmos x/swingset txn type that can call it. Then someone on mainnet can pay a tiny txn fee and get ko111 retired.

This sort of API would be a comfortable authority to expose: you can't hurt anything by checking, the worst you can do is waste some time (processRefcounts() notices that reachable > 0 and returns without doing anything), and you have to pay for it anyways.

We've talked about this sort of approach for more expensive tasks too, like have external code identify alleged reference cycles, then submit a txn to tell the kernel to check on it. Some problems are more amenable to the approach than others: cross-vat reference cycles don't expose enough information to the kernel to let it confirm that dropping all edges at the same time would bring the refcounts to zero (at least until we implement @mhofman 's ideas about revealing the vat-internal edges to the kernel). But a cycle that only went through e.g. promise resolutions could be dealt with.