Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel should retire abandoned non-reachable objects #7212

Closed
warner opened this issue Mar 22, 2023 · 13 comments · Fixed by #8695
Closed

kernel should retire abandoned non-reachable objects #7212

warner opened this issue Mar 22, 2023 · 13 comments · Fixed by #8695
Assignees
Labels
SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Mar 22, 2023

In #6696 (comment) I asked @gibson042 to add a test for a vat-upgrade -time kernel behavior which, it turns out, the kernel does not already do, as his test in #7170 (comment) discovered.

The scenario I was thinking about is:

  • vatA exports a non-durable object to vatB
  • vatB uses the object as a key in a WeakMap, then drops the Presence
    • vatB emits a syscall.dropImport, but not a syscall.retireImport
  • vatA receives a dispatch.dropExport, but maintains an internal strong reference, so does not emit a syscall.retireExport
  • the object's refcount is now 0,1, which means "not reachable", but "yes recognizable" (by vatB)
  • now vatA is terminated or upgraded

I mistakenly believed that the kernel would then send a dispatch.retireImport into vatB. The reasoning is that:

  • vatB cannot reach the object
  • no other vat can reach the object
  • the only other source of a strong reference to the object is vatA, which is dead (or upgraded and the object wasn't durable)
  • since nobody will ever be able to produce a strong ref to the object, it can never become reachable again
  • so vatB (or any other recognizers) ought to delete their now-useless WeakMap entries

But, as test-abandon-export.js shows us, the kernel doesn't do that yet. The kernel object table is updated to show that the object has been abandoned (owner = null), but the refcount is unchanged. The entry will remain until all importing vats retire the import themselves, and that will only happen if their WeakMap gets deleted (probably never, they're usually long-lived).

So the feature to add here is for the kernel to notice when an object with refcount=0,* (i.e. unreachable) becomes orphaned. This can happen because the vat did syscall.abandonExport, or because the kernel's processUpgradeVat() abandoned the object on behalf of a vat being upgraded (#6696), or because the vat was terminated. The kernel should respond to this as if the previously-exporting vat did a syscall.retireExport: it should find all importing/recognizing vats and send them a dispatch.retireImport.

Since we're unlikely to implement this before deploying the kernel in the Vaults release, we must also be able to catch up on unretired unreachable orphans that were created and abandoned during the "deployed but not fixed" window. We'll create an upgrade handler that walks the kernel object table, looking for entries where owner = null and refCount.reachable = 0. For each one, we should perform the retireExport chores. When complete, those entries will be deleted.

We don't know how many such orphaned objects there will be. If there might be a lot of them, we should consider building an upgrade handler which does a limited amount of work in each block (perhaps some number of computrons can be budgeted towards this chore, and once that budget is exceeded, we swtich over to normal deliveries). It should start at ko0 and work lexicographically upwards until all entries have been examined. The handler would need some persistent state to remember it's progress between blocks. The edge-triggered code would also be on the lookout for newly-orphaned or newly-unreachable objects, so if the little-at-a-time loop passed by an earlier object (orphaned but still reachable), then when that object becomes unreachable later, the GC actions should still fire.

@gibson042
Copy link
Member

gibson042 commented Mar 27, 2023

When fixing this issue, remember to update test assertions at

// TODO: Expect and verify observer receipt of dispatch.retireExports

@warner
Copy link
Member Author

warner commented Dec 21, 2023

We don't know how many such orphaned objects there will be

We still don't have many orphaned objects, but one of our remediation plans for #8401 is to terminate the price feed vats, and some of those have sent a lot of vrefs to v7-board, so those vrefs will enter this state (orphaned, weakly referenced by a remaining vat) when we trigger that fix.

As of this morning (21-dec-2023, just before agoric-upgrade-13 activated, in the agreeable follower's "run-17" dataset), v46-scaledPriceAuthority-ATOM has 81,382 exported-merely-recognizable objects, and a random sampling suggests that all are (weakly) imported by v7-board. Another 36,052 come from v69-scaledPriceAuthority-stATOM. Between the two of them, they account for 117,434 of the 118,478 weak imports of v7-board.

When we delete v46 or v69 (replacing them with a new price authority), these objects will fall into the category that needs this issue fixed to free those WeakMap entries in v7-board.

@Chris-Hibbert
Copy link
Contributor

It should start at ko0 and work lexicographically upwards until all entries have been examined. The handler would need some persistent state to remember its progress between blocks.

My plan in remediating empty payments in the recoverySets of escrowPurses (#8686) is to delete up to N objects each cycle, and run each cycle over the entire recoverSet. Once one of the scans finds fewer than N objects, I'll set a flag to never rescan for that purse. As long as scanning for deletable objects is much faster/cheaper than doing the deletion, you don't need any extra record-keeping between incremental cycles.

@mhofman
Copy link
Member

mhofman commented Dec 22, 2023

If there might be a lot of them, we should consider building an upgrade handler which does a limited amount of work in each block (perhaps some number of computrons can be budgeted towards this chore, and once that budget is exceeded, we swtich over to normal deliveries).

As you know I am not of fan of upgrade related work staggered after an upgrade has been "performed".

orphaned but still reachable), then when that object becomes unreachable later, the GC actions should still fire.

Shouldn't that be part of the fixed behavior to immediately retire these objects?

@warner
Copy link
Member Author

warner commented Dec 23, 2023

I wrote a tool to scan our mainnet state for krefs in this state: as of block 13017175 (2023-12-21T12:50:08Z), there was only one. ko111, unowned, imported weakly by v9-zoe as o-54.

% gzcat run-1-chain.slog.gz |grep '\bko111\b'
{"type":"clist","crankNum":381,"mode":"export","vatID":"v12","kobj":"ko111","vobj":"o+d10/1","time":1687198304.521085,"monotime":83.21812876300001}
{"type":"syscall","crankNum":381,"vatID":"v12","deliveryNum":7,"syscallNum":105,"replay":false,"ksc":["send","ko107",{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["ko110","ko111"]},"result":"kp134"}],"vsc":["send","o-52",{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["o+d11/1","o+d10/1"]},"result":"p+7"}],"time":1687198304.521188,"monotime":83.2182318500001}
{"type":"syscall","crankNum":381,"vatID":"v12","deliveryNum":7,"syscallNum":111,"replay":false,"ksc":["send","ko107",{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["ko80","ko111"]},"result":"kp136"}],"vsc":["send","o-52",{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["o-57","o+d10/1"]},"result":"p+9"}],"time":1687198304.522767,"monotime":83.21981093399971}
{"type":"syscall","crankNum":381,"vatID":"v12","deliveryNum":7,"syscallNum":115,"replay":false,"ksc":["send","ko107",{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["ko111"]},"result":"kp137"}],"vsc":["send","o-52",{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["o+d10/1"]},"result":"p+10"}],"time":1687198304.523542,"monotime":83.220585922}
{"type":"crank-start","crankType":"routing","crankNum":382,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["ko110","ko111"]},"result":"kp134"}},"time":1687198304.525159,"monotime":83.22220272699977}
{"type":"crank-start","crankType":"routing","crankNum":384,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["ko80","ko111"]},"result":"kp136"}},"time":1687198304.525452,"monotime":83.22249635499995}
{"type":"crank-start","crankType":"routing","crankNum":385,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["ko111"]},"result":"kp137"}},"time":1687198304.525559,"monotime":83.22260283599981}
{"type":"crank-start","crankType":"delivery","crankNum":386,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["ko110","ko111"]},"result":"kp134"}},"time":1687198304.525675,"monotime":83.22271906200005}
{"type":"clist","crankNum":386,"mode":"import","vatID":"v9","kobj":"ko111","vobj":"o-54","time":1687198304.525802,"monotime":83.22284589800006}
{"type":"deliver","crankNum":386,"vatID":"v9","deliveryNum":21,"replay":false,"kd":["message","ko107",{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["ko110","ko111"]},"result":"kp134"}],"vd":["message","o+d34/1",{"methargs":{"body":"#[\"makeNoEscrowSeat\",[{},{\"exit\":{\"onDemand\":null},\"give\":{},\"want\":{}},\"$0.Alleged: ExitObject\",\"$1.Alleged: SeatHandle\"]]","slots":["o-53","o-54"]},"result":"p-72"}],"time":1687198304.525892,"monotime":83.22293607500009}
{"type":"crank-start","crankType":"delivery","crankNum":390,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["ko80","ko111"]},"result":"kp136"}},"time":1687198304.550052,"monotime":83.24709620600008}
{"type":"deliver","crankNum":390,"vatID":"v9","deliveryNum":23,"replay":false,"kd":["message","ko107",{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["ko80","ko111"]},"result":"kp136"}],"vd":["message","o+d34/1",{"methargs":{"body":"#[\"replaceAllocations\",[[{\"allocation\":{\"Bootstrap\":{\"brand\":\"$0.Alleged: IST brand\",\"value\":\"+1419799859952\"}},\"seatHandle\":\"$1.Alleged: SeatHandle\"}]]]","slots":["o+d10/1","o-54"]},"result":"p-74"}],"time":1687198304.550221,"monotime":83.24726528899977}
{"type":"crank-start","crankType":"delivery","crankNum":392,"message":{"type":"send","target":"ko107","msg":{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["ko111"]},"result":"kp137"}},"time":1687198304.555307,"monotime":83.25235097000002}
{"type":"deliver","crankNum":392,"vatID":"v9","deliveryNum":24,"replay":false,"kd":["message","ko107",{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["ko111"]},"result":"kp137"}],"vd":["message","o+d34/1",{"methargs":{"body":"#[\"exitSeat\",[\"$0.Alleged: SeatHandle\",\"#undefined\"]]","slots":["o-54"]},"result":"p-75"}],"time":1687198304.555461,"monotime":83.25250455599976}
{"type":"syscall","crankNum":2968,"vatID":"v9","deliveryNum":205,"syscallNum":791,"replay":false,"ksc":["dropImports",["ko223","ko262","ko111","ko114","ko122","ko162","ko203","ko205","ko208","ko211","ko214","ko217","ko220"]],"vsc":["dropImports",["o-101","o-114","o-54","o-55","o-59","o-80","o-81","o-83","o-86","o-89","o-92","o-95","o-98"]],"time":1687198333.106308,"monotime":111.80335262499983}

Inside v9-zoe, it appears as a weak key of zoe's seatHandleToZoeSeatAdmin WeakMapStore (vc.24), whose value is a zoeSeatAdmin.

% grep '\bo-54\b' v9.vs
v9.vs.vc.24.r0000000001:o-54|{"body":"#\"$0.Alleged: ZoeSeatKit zoeSeatAdmin\"","slots":["o+d31/1:1"]}
v9.vs.vc.24.|o-54|1
v9.vs.vom.ir.o-54|24|1

It was created by v12 (which got terminated early) as the SeatHandle of a makeNoEscrowSeat message sent to v9-zoe.

v12 was zcf-centralSupply-centralSupply, which served its purpose during bootstrap, and was terminated 83 seconds into the vaults release:

{"type":"terminate","vatID":"v12","shouldReject":false,"info":{"body":"#\"payment retrieved\"","slots":[]},"time":1687198304.605094,"monotime":83.30213865200011}

Since there's only one (until we kill the price-feed vats), we could survive not doing the scan-for-old-cases cleanup, at the cost of a few zoe objects being kept around forever.

Doing that scan with our current swing-store is expensive, because the only way to find these abandoned+unreachable krefs is to scan the entire kernel object table, accumulating two rows at a time (.owner and .refCounts), finding the krefs where the rows meet our criteria (.owner is missing, .refCounts starts with 0,), accumulating those krefs into a list, then retiring them. And, retiring a kref requires us to scan through the c-lists of all vats to find the ones that need to be notified (we don't have an index in that direction, #3223), making the cost O(numVats * numKrefs).

We might want to defer doing this cleanup until we've improved the way we store the kernel object table and c-lists (#6677). If we had an efficient query for all unowned objects, we could limit the scan to just them, and express it as something like SELECT kref FROM kernelObjects WHERE owner IS NULL AND REACHABLE = 0, this would be a lot faster. Likewise, our fix for #3223 would be SELECT vatID,vref FROM clists WHERE kref=?.

warner added a commit that referenced this issue Dec 24, 2023
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so as far as the rest of the world is concerned, the
object has been retired. And because the owning vat can't retire it by
itself, the kernel needs to do the retirement.

TODO: do we handle entering this state from the other direction: when
an orphaned object becomes unreachable? `processRefcounts()` has
something but I'm not sure it's sufficient.

closes #7212
@warner
Copy link
Member Author

warner commented Dec 24, 2023

I think there are two cases to consider. The PR #8695 I just pushed only handles one of them.

  • case 1: an object is merely-recognizable, and then it gets abandoned by its owner (subcase A: the owner vat is terminated, subcase B: the owner vat calls syscall.abandonExports(), subcase C: the owner vat is upgraded and the vref was not durable)
  • case 2: an object is abandoned by its owner (all three subcases), then it ceases to be reachable (because the last importing vat did a syscall.dropImports(), or was terminated)

The PR handles case 1, which involves code in kernelKeeper.orphanKernelObjects() (which is called from kernel.js processUpgradeVat for subcase C, kernelKeeper.js cleanupAfterTerminatedVat for subcase A, and kernelSyscall.js abandonExports for subcase B). That function is changed to look for unreachable koids and submit them to retireKernelObjects().

But we still need to handle case 2. I think that wants to get handled in processRefcounts, in the clause where reachable === 0 but in the else clause of if (ownerVatID). Currently we handle the recognizable === 0 case, but we need to do something else if recognizable !== 0.

@warner
Copy link
Member Author

warner commented Dec 24, 2023

Let's see, the full state space (as observed by processRefcounts() is basically:

  • object is owned or orphaned (because of termination, upgrade, or abandonment), times
    • if owned, the owning vat knows it to be reachable or merely-recognizable
  • object is reachable, merely-recognizable, or unreferenced
    • (unreferenced is temporary, GC actions will soon delete the kernel-object table entry)
stateDiagram-v2
  state "ORR: owned\n known-reachable\n reachable\n GCA: none" as ORR
  state "ORM: owned\n known-reachable\n merely-recognizable\n GCA: dropExport" as ORM
  state "ORN: owned\n known-reachable\n unreferenced\n GCA: dropExport\n GCA: retireExport" as ORN
  state "OMM: owned\n known-merely-recognizable\n merely-recognizable\n GCA: none" as OMM
  state "OMN: owned\n known-merely-recognizable\n unreferenced\n GCA: retireExport" as OMN
  state "?ONM: owned\n known-unreferenced\n merely-recognizable\n GCA: retireImport" as ONM
  state "PR orPhaned\n reachable\n GCA: none" as PR
  state "PM orPhaned\n merely-recognizable\n GCA: (TODO synth s.retireExports)\n GCA.retireImport\n delete" as PM
  state "deleted" as d
  ORR --> ORM : importer s.dropImports
  ORM --> ORR : re-import
  ORM --> ORN : importer s.retireImports
  ORM --> OMM : d.dropExports
  OMM --> OMN : importer s.retireImports
  OMM --> d : exporter s.retireExports\n\n GCA.retireImport\n delete
  OMM --> ORR : re-export
  OMM --> d : orphaned\n\n remove-owner\n GCA.retireImport\n delete
  ORN --> OMN : d.dropExports
  OMN --> d : d.retireExports\n\n delete
  ORR --> PR : orphaned\n\n remove-owner
  PR --> PM : importer s.dropImports
  PM --> d
  ko111 --> PM
  ? --> ONM
Loading

@mhofman
Copy link
Member

mhofman commented May 8, 2024

  • now vatA is terminated or upgraded

I mistakenly believed that the kernel would then send a dispatch.retireImport into vatB. The reasoning is that:

  • vatB cannot reach the object
  • no other vat can reach the object
  • the only other source of a strong reference to the object is vatA, which is dead (or upgraded and the object wasn't durable)

I believe this is a wrong assumption. How does the kernel know the object wasn't durable? Whoever is in a position to discover this is the party responsible for generating a retireExport (or abandonExport really) on behalf of vatA. In #9338 I argue that the vat should clean up after its old incarnation after an upgrade (possibly with a backstop for non cooperating vats like #7170). In the case of a termination, the kernel can unilaterally abandon any exports the terminated vat had made.

Edit: I missed a step. I didn't realize that abandoning is already what is happening, it's just that the kernel doesn't consider an abandonment the same as if the vat had retired its export. Yeah that's weird.

But now I'm confused by the following:

The kernel object table is updated to show that the object has been abandoned (owner = null), but the refcount is unchanged.

We'll create an upgrade handler that walks the kernel object table, looking for entries where owner = null and refCount.reachable = 0.

I'm confused, is the exporter included in the reachable refcount? In that case, wouldn't these abandoned but not retired export have a reachable count of 1 since we didn't decref? Can we always assume that the exporter accounted for one of the ref?

@mhofman
Copy link
Member

mhofman commented May 8, 2024

We'll create an upgrade handler that walks the kernel object table, looking for entries where owner = null and refCount.reachable = 0. For each one, we should perform the retireExport chores. When complete, those entries will be deleted.

This took me a minute to process, but it might be important to highlight that a vat abandoning an export does not mean the object should no longer be referenced by anyone. While sending a message to the object will splat, other vats may still share the reference and use it for its identity. As such the kernel cannot instruct the vats to forget about the reference unless the reference is unreachable by anyone.

@warner
Copy link
Member Author

warner commented Jun 17, 2024

I'm confused, is the exporter included in the reachable refcount? In that case, wouldn't these abandoned but not retired export have a reachable count of 1 since we didn't decref? Can we always assume that the exporter accounted for one of the ref?

Nope, the exporter doesn't get a refcount (of either flavor), and processRefCounts triggers when the counts reach zero. (If the export counted, it would need to trigger when the count reached one).

The design is tuned to minimize space, potentially at the cost of increased churn. One consequence is that a single vref may have multiple (non-overlapping) krefs assigned to it over the lifetime of the vat. If vatA exports vref o+1, it gets assigned ko10, it gets imported and forgotten and retired, we send a dispatch.retireExports(o+1) into vatA, and as that is delivered, we delete ko10 from the kernel tables and remove o+1 <-> ko10 from the vatA clist. At that point, o+1 might still exist inside vatA, but the kernel doesn't know about it, and if/when vatA re-exports it, the kernel will assign it a new kref (maybe ko11).

The alternative would have been to have the kernel maintain the c-list entry until the exporting vat retired the vref, which is basically equivalent to having the kernel maintain its own recognizable count. For long-lived single-use objects, that would consume c-list and kernel-object-table space even for objects that are never used anymore. But it would improve the churn, so long-lived multiple-use objects which are dropped entirely by their importers (remembering that BOYD doesn't happen instantly, which might help reduce the churn, by extending the lifetime of the imports, perhaps enough to overlap those multiple uses).

I haven't done the analysis to determine how commonly we experience that churn. The tool would want to scan through the slogs and look at both the KernelDeliveryObject and VatDeliveryObject (also the syscall object pairs) and build a list of kref/vref pairs in a database. Then, after processing everything, count how many unique krefs were seen for each vref. This needs slogs, because the transcripts themselves only have vrefs. We could build a more complicated tool that only used the transcripts, by doing a stateful thing where we deduce the current contents of the c-lists (populate when the vref first appears, remove when a retire appears, assign unique pseudo-krefs at population time (which would each map to a real kref, except that this tool never sees the real krefs), and then do the same uniqueness processing. Actually, we could simplify that: just count how many times a retire appears for each vref.

This took me a minute to process, but it might be important to highlight that a vat abandoning an export does not mean the object should no longer be referenced by anyone. While sending a message to the object will splat, other vats may still share the reference and use it for its identity. As such the kernel cannot instruct the vats to forget about the reference unless the reference is unreachable by anyone.

Right, syscall.abandonExport causes the owner to be nulled out, but it doesn't affect the reachable refcount. We wind up with two cases:

  • reachable > 0: other vats can still emit strong references, and we won't retire the object until the last one goes away
  • reachable === 0: no vat can emit a strong reference (at best they can merely recognize it), so we should retire the object now

Abandoning an unreachable export is the trigger: vats can't send messages to it, nor can they share a reference, because all they've got is a WeakMap key, and they can't reach those.

@warner
Copy link
Member Author

warner commented Jun 18, 2024

Ok I think I can simplify this into two pieces. The first is our rule that we push a kref onto maybeFreeKrefs whenever we decrement a refcount (either reachable or recognizable) and it touches zero, or (and this is new) when we clear the .owner of the kref (either by explicit syscall.abandonExport, the upgrade-time non-durable -export check, or the maybe-slow termination/deletion scan). That will make the end-of-crank processRefcounts() look at the kref.

The second is a table of checks/actions that processRefcounts() performs. As a reminder, our rule is that processRefcounts is safe to call too much: the system would be correct, just ridiculously inefficient, if we added every single kref to maybeFreeKrefs on every crank, and let processRefcounts rule out the ones that didn't need to be there. That reduces the refcount-changing methods' responsibility down to "add any kref that might need checking", and gives processRefcounts the responsibility of "only take action on krefs that actually need it".

When processRefcounts looks at each kref, it checks the koNN.owner and koNN.refcount = (reachableCount, recognizableCount) values, and it will classify it into one of the following six categories:

|                          | reachable (>=1,>=1) | recognizable (0,>=1) | neither (0,0)      |
|--------------------------+---------------------+----------------------+--------------------|
| owned (koNN.owner = vNN) | A: nothing          | B: d.dropExports     | C: d.retireExports |
| abandoned (= null)       | D: nothing          | E: d.retireImports   | F: delete          |

Case A: The kref is still reachable, so do nothing. This happens when a break-before-make handoff occurs, like decrementing the refcount as we take a message off the run-queue, then incrementing it as we translate it for delivery and add it to the receiving vat's c-list, so the refcount bounces off 0 briefly.

Case B: The kref is unreachable by other vats, but they can still recognize it. In this case, processRefcounts() needs to look at the exporting vat's c-list entry, which has an extra flag that says whether the export is reachable or recognizable (vNN.c.$kref is $R $vref, where R is either "R" for reachable or "_" for merely-recognizable). The code names this flag isReachable. If isReachable, we add a gcAction to deliver dispatch.dropExports() into the owning vat. The subsequent delivery (which bails if it finds the clist is no longer in that state) will clear the "R" flag.

If the kref is somehow re-added to maybeFreeKrefs and processRefcounts() is run again before the dispatch.dropExports() happens, we'll add a redundant gcAction, but since gcActions are held in a set, this won't cause multiple deliveries.

Case C: The kref is neither reachable nor recognizable by other vats, but it is still being exported. If isReachable then we need to add both dispatch.dropExports and dispatch.retireExports gcActions (we can hit this case when both reachable and recognizable refcounts are dropped on the same crank, which is pretty common, and happens during any BOYD when a vat wasn't using a WeakMap). If isReachable is false, then we only need to add dispatch.retireExports. As before, the gcAction delivery code will double-check that the calls are still appropriate, so we can tolerate changes (re-exports) that occur between processRefcounts() and the later delivery.

When dispatch.retireExports is delivered, the kernel will call kernelKeeper.deleteKernelObject(), which deletes both koNN.owner and koNN.refcounts. Then the translation call will delete the exporting vat's c-list entries. Combined, this removes the last of the kernel's knowledge of the kref. If/when the vat re-exports their vref, the kernel will assign it a new kref.

TODO: when do we enqueue dispatch.retireImports to the remaining recognizing vats? ANSWER: for case C it's irrelevant, all the other vats have already done syscall.retireImports. But in general, the dispatch.retireImports are enqueued by kernelKeeper.retireKernelObjects(), which is either called by the living vat when it does syscall.retireExports, or (will be done) by processRefCounts when it hits Case E.

Case D: The vat is either terminated, or the vref was non-durable and was abandoned by an upgrade. But, other vats can still reach it. This is fine, we don't need to tell anybody anything. The former owning vat is either dead or doesn't care. The importing vats keep importing it: any messages they send to it will go splat, but they can continue to tell each other about the object, and they can use it in WeakMaps. Nothing changes until the object becomes unreachable.

Case E: Like above, the vat is dead/upgraded, but other vats can merely recognize the object, not reach it. We can reach here in one of two ways: Case D plus a reachable-count decref, or case B plus a vat termination/upgrade. In either case, we want to add a dispatch.retireImports gcAction, which will tell all importing vats that the object has been retired, so they can drop their WeakMap entries.

Case F: No vat knows about the kref. We might get into this state if the vat was terminated/upgraded while the state-C gcActions were still queued, maybe. All we need to do is delete koNN.refcounts, to finish forgetting about the kref.

@warner
Copy link
Member Author

warner commented Jun 25, 2024

Note to self, the TODO test that we're trying to fix is at

// TODO: Expect and verify observer receipt of dispatch.retireExports
// and corresponding removal of weakKref ref counts.
// https://github.com/Agoric/agoric-sdk/issues/7212
t.is(
kvStore.get(`${weakKref}.refCount`),
'0,1',
'weak observation of abandoned object retains ref counts before #7212',
);
// t.is(
// kvStore.get(`${weakKref}.refCount`),
// undefined,
// 'unreachable abandoned object is forgotten',
// );
. I'm adding more tests to exercise the full set of states (to be named gc-kernel-orphan.test), which are currently failing for both syscall.abandonExports() and terminateVat(), happening either before or after an importing vat uses syscall.dropImports() to drop down to the merely-recognizable state. My new test is not exercising the upgrade/abandon-non-durables case.

warner added a commit that referenced this issue Jun 25, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so as far as the rest of the world is concerned, the
object has been retired. And because the owning vat can't retire it by
itself, the kernel needs to do the retirement.

TODO: do we handle entering this state from the other direction: when
an orphaned object becomes unreachable? `processRefcounts()` has
something but I'm not sure it's sufficient.

closes #7212
warner added a commit that referenced this issue Jun 26, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so as far as the rest of the world is concerned, the
object has been retired. And because the owning vat can't retire it by
itself, the kernel needs to do the retirement.

TODO: do we handle entering this state from the other direction: when
an orphaned object becomes unreachable? `processRefcounts()` has
something but I'm not sure it's sufficient.

closes #7212
warner added a commit that referenced this issue Jun 26, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Jun 26, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
orphanKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's orphanKernelObjects() will get
processed promptly.

Change getObjectRefCount tolerate deleted krefs (missing
`koNN.refCount`) by just returning 0,0. This fixes a potential kernel
panic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Jun 27, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Jun 27, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

Changes getObjectRefCount to tolerate deleted krefs (missing
`koNN.refCount`) by just returning 0,0. This fixes a potential kernel
panic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.

closes #7212
@warner
Copy link
Member Author

warner commented Jun 27, 2024

remediation idea: add a controller-visible API that accepts krefs and submits them to maybeFreeKrefs, then does enough of a crank to trigger processRefcounts(). Then add a new cosmos x/swingset txn type that can call it. Then someone on mainnet can pay a tiny txn fee and get ko111 retired.

This sort of API would be a comfortable authority to expose: you can't hurt anything by checking, the worst you can do is waste some time (processRefcounts() notices that reachable > 0 and returns without doing anything), and you have to pay for it anyways.

We've talked about this sort of approach for more expensive tasks too, like have external code identify alleged reference cycles, then submit a txn to tell the kernel to check on it. Some problems are more amenable to the approach than others: cross-vat reference cycles don't expose enough information to the kernel to let it confirm that dropping all edges at the same time would bring the refcounts to zero (at least until we implement @mhofman 's ideas about revealing the vat-internal edges to the kernel). But a cycle that only went through e.g. promise resolutions could be dealt with.

warner added a commit that referenced this issue Jul 2, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

Changes getObjectRefCount to tolerate deleted krefs (missing
`koNN.refCount`) by just returning 0,0. This fixes a potential kernel
panic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Jul 10, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Jul 10, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

Changes getObjectRefCount to tolerate deleted krefs (missing
`koNN.refCount`) by just returning 0,0. This fixes a potential kernel
panic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Jul 11, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Jul 11, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

Changes getObjectRefCount to tolerate deleted krefs (missing
`koNN.refCount`) by just returning 0,0. This fixes a potential kernel
panic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Jul 11, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Jul 11, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

Changes getObjectRefCount to tolerate deleted krefs (missing
`koNN.refCount`) by just returning 0,0. This fixes a potential kernel
panic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Jul 11, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Jul 11, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

Changes getObjectRefCount to tolerate deleted krefs (missing
`koNN.refCount`) by just returning 0,0. This fixes a potential kernel
panic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Aug 10, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Aug 10, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

Changes getObjectRefCount to tolerate deleted krefs (missing
`koNN.refCount`) by just returning 0,0. This fixes a potential kernel
panic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Aug 11, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Aug 11, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat is dead, it can't retire the koid by itself,
the kernel needs to do the retirement on the vat's behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

This also changes getObjectRefCount() to tolerate deleted
krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes
a potential kernel panic in the new approach, when a kref is
recognizable by one vat but only reachable by a send-message on the
run-queue, then becomes unreachable as that message is delivered (the
run-queue held the last strong reference), and if the target vat does
syscall.exit during the delivery. The decref pushes the kref onto
maybeFreeKrefs, the terminateVat retires the merely-recognizable
now-orphaned kref, then processRefcounts used getObjectRefCount() to
grab the refcount for the now-retired (and deleted) kref, which
asserted that the koNN.refCount key still existed, which didn't.

This occured in "zoe - secondPriceAuction -- valid input" unit test ,
where the contract did syscall.exit in response to a timer wake()
message sent to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Aug 11, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Aug 11, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat is dead, it can't retire the koid by itself,
the kernel needs to do the retirement on the vat's behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

This also changes getObjectRefCount() to tolerate deleted
krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes
a potential kernel panic in the new approach, when a kref is
recognizable by one vat but only reachable by a send-message on the
run-queue, then becomes unreachable as that message is delivered (the
run-queue held the last strong reference), and if the target vat does
syscall.exit during the delivery. The decref pushes the kref onto
maybeFreeKrefs, the terminateVat retires the merely-recognizable
now-orphaned kref, then processRefcounts used getObjectRefCount() to
grab the refcount for the now-retired (and deleted) kref, which
asserted that the koNN.refCount key still existed, which didn't.

This occured in "zoe - secondPriceAuction -- valid input" unit test ,
where the contract did syscall.exit in response to a timer wake()
message sent to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Aug 11, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Aug 11, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat is dead, it can't retire the koid by itself,
the kernel needs to do the retirement on the vat's behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

This also changes getObjectRefCount() to tolerate deleted
krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes
a potential kernel panic in the new approach, when a kref is
recognizable by one vat but only reachable by a send-message on the
run-queue, then becomes unreachable as that message is delivered (the
run-queue held the last strong reference), and if the target vat does
syscall.exit during the delivery. The decref pushes the kref onto
maybeFreeKrefs, the terminateVat retires the merely-recognizable
now-orphaned kref, then processRefcounts used getObjectRefCount() to
grab the refcount for the now-retired (and deleted) kref, which
asserted that the koNN.refCount key still existed, which didn't.

This occured in "zoe - secondPriceAuction -- valid input" unit test ,
where the contract did syscall.exit in response to a timer wake()
message sent to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Aug 11, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Aug 11, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat is dead, it can't retire the koid by itself,
the kernel needs to do the retirement on the vat's behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

This also changes getObjectRefCount() to tolerate deleted
krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes
a potential kernel panic in the new approach, when a kref is
recognizable by one vat but only reachable by a send-message on the
run-queue, then becomes unreachable as that message is delivered (the
run-queue held the last strong reference), and if the target vat does
syscall.exit during the delivery. The decref pushes the kref onto
maybeFreeKrefs, the terminateVat retires the merely-recognizable
now-orphaned kref, then processRefcounts used getObjectRefCount() to
grab the refcount for the now-retired (and deleted) kref, which
asserted that the koNN.refCount key still existed, which didn't.

This occured in "zoe - secondPriceAuction -- valid input" unit test ,
where the contract did syscall.exit in response to a timer wake()
message sent to a single-use wakeObj.

closes #7212
warner added a commit that referenced this issue Aug 12, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
warner added a commit that referenced this issue Aug 12, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat is dead, it can't retire the koid by itself,
the kernel needs to do the retirement on the vat's behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

This also changes getObjectRefCount() to tolerate deleted
krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes
a potential kernel panic in the new approach, when a kref is
recognizable by one vat but only reachable by a send-message on the
run-queue, then becomes unreachable as that message is delivered (the
run-queue held the last strong reference), and if the target vat does
syscall.exit during the delivery. The decref pushes the kref onto
maybeFreeKrefs, the terminateVat retires the merely-recognizable
now-orphaned kref, then processRefcounts used getObjectRefCount() to
grab the refcount for the now-retired (and deleted) kref, which
asserted that the koNN.refCount key still existed, which didn't.

This occured in "zoe - secondPriceAuction -- valid input" unit test ,
where the contract did syscall.exit in response to a timer wake()
message sent to a single-use wakeObj.

closes #7212
@mergify mergify bot closed this as completed in #8695 Aug 14, 2024
mergify bot pushed a commit that referenced this issue Aug 14, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
mergify bot pushed a commit that referenced this issue Aug 14, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat is dead, it can't retire the koid by itself,
the kernel needs to do the retirement on the vat's behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

This also changes getObjectRefCount() to tolerate deleted
krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes
a potential kernel panic in the new approach, when a kref is
recognizable by one vat but only reachable by a send-message on the
run-queue, then becomes unreachable as that message is delivered (the
run-queue held the last strong reference), and if the target vat does
syscall.exit during the delivery. The decref pushes the kref onto
maybeFreeKrefs, the terminateVat retires the merely-recognizable
now-orphaned kref, then processRefcounts used getObjectRefCount() to
grab the refcount for the now-retired (and deleted) kref, which
asserted that the koNN.refCount key still existed, which didn't.

This occured in "zoe - secondPriceAuction -- valid input" unit test ,
where the contract did syscall.exit in response to a timer wake()
message sent to a single-use wakeObj.

Also rename abandonKernelObject back to orphanKernelObject, the name
fits better now.

closes #7212
mergify bot added a commit that referenced this issue Aug 14, 2024
fix(swingset): retire unreachable orphans

If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat can't retire it by itself, the kernel needs to
do the retirement on its behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

Changes getObjectRefCount to tolerate deleted krefs (missing
`koNN.refCount`) by just returning 0,0. This fixes a potential kernel
panic in the new approach, when a kref is recognizable by one vat but
only reachable by a send-message on the run-queue, then becomes
unreachable as that message is delivered (the run-queue held the last
strong reference), if the target vat does syscall.exit during the
delivery. The decref pushes the kref onto maybeFreeKrefs, the
terminateVat retires the merely-recognizable now-orphaned kref, then
processRefcounts used getObjectRefCount() to grab the refcount for the
now-retired (and deleted) kref, which asserted that the koNN.refCount
key still existed, which didn't.

This occured in zoe - secondPriceAuction -- valid input , where the
contract did syscall.exit in response to a timer wake() message sent
to a single-use wakeObj.

closes #7212
kriskowal pushed a commit that referenced this issue Aug 27, 2024
This adds a new (failing) test of #7212, enhances some other tests to
cover the same thing, and uncomments a portion of upgrade.test.js
which was commented out when we discovered the bug.

These will only pass when the kernel properly retires unreachable
objects that have just been abandoned by their owning vat.

The new test (gc-kernel-orphan.test.js) also checks that vat
termination on the same crank that retires an object will not cause a
panic.
kriskowal pushed a commit that referenced this issue Aug 27, 2024
If a kernel object ("koid", the object subset of krefs) is
unreachable, and then becomes orphaned (either because the owning vat
was terminated, or called `syscall.abandonExports`, or was upgraded
and the koid was ephemeral), then retire it immediately.

The argument is that the previously-owning vat can never again talk
about the object, so it can never become reachable again, which is
normally the point at which the owning vat would retire it. But
because the owning vat is dead, it can't retire the koid by itself,
the kernel needs to do the retirement on the vat's behalf.

We now consolidate retirement responsibilities into
processRefcounts(): when terminateVat or syscall.abandonExports use
abandonKernelObjects() to mark a kref as orphaned, it also adds the
kref to maybeFreeKrefs, and then processRefcounts() is responsible for
noticing the kref is both orphaned and unreachable, and then notifying
any importers of its retirement.

I double-checked that cleanupAfterTerminatedVat will always be
followed by a processRefcounts(), by virtue of either being called
from processDeliveryMessage (in the crankResults.terminate clause), or
from within a device invocation syscall (which only happens during a
delivery, so follows the same path). We need this to ensure that any
maybeFreeKrefs created by the cleanup's abandonKernelObjects() will
get processed promptly.

This also changes getObjectRefCount() to tolerate deleted
krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes
a potential kernel panic in the new approach, when a kref is
recognizable by one vat but only reachable by a send-message on the
run-queue, then becomes unreachable as that message is delivered (the
run-queue held the last strong reference), and if the target vat does
syscall.exit during the delivery. The decref pushes the kref onto
maybeFreeKrefs, the terminateVat retires the merely-recognizable
now-orphaned kref, then processRefcounts used getObjectRefCount() to
grab the refcount for the now-retired (and deleted) kref, which
asserted that the koNN.refCount key still existed, which didn't.

This occured in "zoe - secondPriceAuction -- valid input" unit test ,
where the contract did syscall.exit in response to a timer wake()
message sent to a single-use wakeObj.

Also rename abandonKernelObject back to orphanKernelObject, the name
fits better now.

closes #7212
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants