-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel should retire abandoned non-reachable objects #7212
Comments
When fixing this issue, remember to update test assertions at
|
We still don't have many orphaned objects, but one of our remediation plans for #8401 is to terminate the price feed vats, and some of those have sent a lot of vrefs to v7-board, so those vrefs will enter this state (orphaned, weakly referenced by a remaining vat) when we trigger that fix. As of this morning (21-dec-2023, just before When we delete v46 or v69 (replacing them with a new price authority), these objects will fall into the category that needs this issue fixed to free those WeakMap entries in v7-board. |
My plan in remediating empty payments in the recoverySets of escrowPurses (#8686) is to delete up to N objects each cycle, and run each cycle over the entire recoverSet. Once one of the scans finds fewer than N objects, I'll set a flag to never rescan for that purse. As long as scanning for deletable objects is much faster/cheaper than doing the deletion, you don't need any extra record-keeping between incremental cycles. |
As you know I am not of fan of upgrade related work staggered after an upgrade has been "performed".
Shouldn't that be part of the fixed behavior to immediately retire these objects? |
I wrote a tool to scan our mainnet state for krefs in this state: as of block 13017175 (2023-12-21T12:50:08Z), there was only one.
Inside v9-zoe, it appears as a weak key of zoe's
It was created by v12 (which got terminated early) as the SeatHandle of a v12 was zcf-centralSupply-centralSupply, which served its purpose during bootstrap, and was terminated 83 seconds into the vaults release:
Since there's only one (until we kill the price-feed vats), we could survive not doing the scan-for-old-cases cleanup, at the cost of a few zoe objects being kept around forever. Doing that scan with our current swing-store is expensive, because the only way to find these abandoned+unreachable krefs is to scan the entire kernel object table, accumulating two rows at a time ( We might want to defer doing this cleanup until we've improved the way we store the kernel object table and c-lists (#6677). If we had an efficient query for all unowned objects, we could limit the scan to just them, and express it as something like |
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so as far as the rest of the world is concerned, the object has been retired. And because the owning vat can't retire it by itself, the kernel needs to do the retirement. TODO: do we handle entering this state from the other direction: when an orphaned object becomes unreachable? `processRefcounts()` has something but I'm not sure it's sufficient. closes #7212
I think there are two cases to consider. The PR #8695 I just pushed only handles one of them.
The PR handles case 1, which involves code in But we still need to handle case 2. I think that wants to get handled in |
Let's see, the full state space (as observed by
stateDiagram-v2
state "ORR: owned\n known-reachable\n reachable\n GCA: none" as ORR
state "ORM: owned\n known-reachable\n merely-recognizable\n GCA: dropExport" as ORM
state "ORN: owned\n known-reachable\n unreferenced\n GCA: dropExport\n GCA: retireExport" as ORN
state "OMM: owned\n known-merely-recognizable\n merely-recognizable\n GCA: none" as OMM
state "OMN: owned\n known-merely-recognizable\n unreferenced\n GCA: retireExport" as OMN
state "?ONM: owned\n known-unreferenced\n merely-recognizable\n GCA: retireImport" as ONM
state "PR orPhaned\n reachable\n GCA: none" as PR
state "PM orPhaned\n merely-recognizable\n GCA: (TODO synth s.retireExports)\n GCA.retireImport\n delete" as PM
state "deleted" as d
ORR --> ORM : importer s.dropImports
ORM --> ORR : re-import
ORM --> ORN : importer s.retireImports
ORM --> OMM : d.dropExports
OMM --> OMN : importer s.retireImports
OMM --> d : exporter s.retireExports\n\n GCA.retireImport\n delete
OMM --> ORR : re-export
OMM --> d : orphaned\n\n remove-owner\n GCA.retireImport\n delete
ORN --> OMN : d.dropExports
OMN --> d : d.retireExports\n\n delete
ORR --> PR : orphaned\n\n remove-owner
PR --> PM : importer s.dropImports
PM --> d
ko111 --> PM
? --> ONM
|
I believe this is a wrong assumption. How does the kernel know the object wasn't durable? Whoever is in a position to discover this is the party responsible for generating a Edit: I missed a step. I didn't realize that abandoning is already what is happening, it's just that the kernel doesn't consider an abandonment the same as if the vat had retired its export. Yeah that's weird. But now I'm confused by the following:
I'm confused, is the exporter included in the reachable refcount? In that case, wouldn't these abandoned but not retired export have a reachable count of 1 since we didn't decref? Can we always assume that the exporter accounted for one of the ref? |
This took me a minute to process, but it might be important to highlight that a vat abandoning an export does not mean the object should no longer be referenced by anyone. While sending a message to the object will splat, other vats may still share the reference and use it for its identity. As such the kernel cannot instruct the vats to forget about the reference unless the reference is unreachable by anyone. |
Nope, the exporter doesn't get a refcount (of either flavor), and The design is tuned to minimize space, potentially at the cost of increased churn. One consequence is that a single vref may have multiple (non-overlapping) krefs assigned to it over the lifetime of the vat. If vatA exports vref The alternative would have been to have the kernel maintain the c-list entry until the exporting vat retired the vref, which is basically equivalent to having the kernel maintain its own I haven't done the analysis to determine how commonly we experience that churn. The tool would want to scan through the slogs and look at both the KernelDeliveryObject and VatDeliveryObject (also the syscall object pairs) and build a list of kref/vref pairs in a database. Then, after processing everything, count how many unique krefs were seen for each vref. This needs slogs, because the transcripts themselves only have vrefs. We could build a more complicated tool that only used the transcripts, by doing a stateful thing where we deduce the current contents of the c-lists (populate when the vref first appears, remove when a retire appears, assign unique pseudo-krefs at population time (which would each map to a real kref, except that this tool never sees the real krefs), and then do the same uniqueness processing. Actually, we could simplify that: just count how many times a retire appears for each vref.
Right,
Abandoning an unreachable export is the trigger: vats can't send messages to it, nor can they share a reference, because all they've got is a WeakMap key, and they can't reach those. |
Ok I think I can simplify this into two pieces. The first is our rule that we push a kref onto The second is a table of checks/actions that When
Case A: The kref is still reachable, so do nothing. This happens when a break-before-make handoff occurs, like decrementing the refcount as we take a message off the run-queue, then incrementing it as we translate it for delivery and add it to the receiving vat's c-list, so the refcount bounces off 0 briefly. Case B: The kref is unreachable by other vats, but they can still recognize it. In this case, If the kref is somehow re-added to Case C: The kref is neither reachable nor recognizable by other vats, but it is still being exported. If When TODO: when do we enqueue dispatch.retireImports to the remaining recognizing vats? ANSWER: for case C it's irrelevant, all the other vats have already done syscall.retireImports. But in general, the Case D: The vat is either terminated, or the vref was non-durable and was abandoned by an upgrade. But, other vats can still reach it. This is fine, we don't need to tell anybody anything. The former owning vat is either dead or doesn't care. The importing vats keep importing it: any messages they send to it will go splat, but they can continue to tell each other about the object, and they can use it in WeakMaps. Nothing changes until the object becomes unreachable. Case E: Like above, the vat is dead/upgraded, but other vats can merely recognize the object, not reach it. We can reach here in one of two ways: Case D plus a reachable-count decref, or case B plus a vat termination/upgrade. In either case, we want to add a Case F: No vat knows about the kref. We might get into this state if the vat was terminated/upgraded while the state-C gcActions were still queued, maybe. All we need to do is delete |
Note to self, the TODO test that we're trying to fix is at agoric-sdk/packages/SwingSet/test/upgrade/upgrade.test.js Lines 710 to 722 in d941b39
gc-kernel-orphan.test ), which are currently failing for both syscall.abandonExports() and terminateVat() , happening either before or after an importing vat uses syscall.dropImports() to drop down to the merely-recognizable state. My new test is not exercising the upgrade/abandon-non-durables case.
|
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so as far as the rest of the world is concerned, the object has been retired. And because the owning vat can't retire it by itself, the kernel needs to do the retirement. TODO: do we handle entering this state from the other direction: when an orphaned object becomes unreachable? `processRefcounts()` has something but I'm not sure it's sufficient. closes #7212
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so as far as the rest of the world is concerned, the object has been retired. And because the owning vat can't retire it by itself, the kernel needs to do the retirement. TODO: do we handle entering this state from the other direction: when an orphaned object becomes unreachable? `processRefcounts()` has something but I'm not sure it's sufficient. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use orphanKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's orphanKernelObjects() will get processed promptly. Change getObjectRefCount tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. Changes getObjectRefCount to tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
remediation idea: add a This sort of API would be a comfortable authority to expose: you can't hurt anything by checking, the worst you can do is waste some time ( We've talked about this sort of approach for more expensive tasks too, like have external code identify alleged reference cycles, then submit a txn to tell the kernel to check on it. Some problems are more amenable to the approach than others: cross-vat reference cycles don't expose enough information to the kernel to let it confirm that dropping all edges at the same time would bring the refcounts to zero (at least until we implement @mhofman 's ideas about revealing the vat-internal edges to the kernel). But a cycle that only went through e.g. promise resolutions could be dealt with. |
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. Changes getObjectRefCount to tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. Changes getObjectRefCount to tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. Changes getObjectRefCount to tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. Changes getObjectRefCount to tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. Changes getObjectRefCount to tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. Changes getObjectRefCount to tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat is dead, it can't retire the koid by itself, the kernel needs to do the retirement on the vat's behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. This also changes getObjectRefCount() to tolerate deleted krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), and if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in "zoe - secondPriceAuction -- valid input" unit test , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat is dead, it can't retire the koid by itself, the kernel needs to do the retirement on the vat's behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. This also changes getObjectRefCount() to tolerate deleted krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), and if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in "zoe - secondPriceAuction -- valid input" unit test , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat is dead, it can't retire the koid by itself, the kernel needs to do the retirement on the vat's behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. This also changes getObjectRefCount() to tolerate deleted krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), and if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in "zoe - secondPriceAuction -- valid input" unit test , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat is dead, it can't retire the koid by itself, the kernel needs to do the retirement on the vat's behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. This also changes getObjectRefCount() to tolerate deleted krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), and if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in "zoe - secondPriceAuction -- valid input" unit test , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat is dead, it can't retire the koid by itself, the kernel needs to do the retirement on the vat's behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. This also changes getObjectRefCount() to tolerate deleted krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), and if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in "zoe - secondPriceAuction -- valid input" unit test , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat is dead, it can't retire the koid by itself, the kernel needs to do the retirement on the vat's behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. This also changes getObjectRefCount() to tolerate deleted krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), and if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in "zoe - secondPriceAuction -- valid input" unit test , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. Also rename abandonKernelObject back to orphanKernelObject, the name fits better now. closes #7212
fix(swingset): retire unreachable orphans If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat can't retire it by itself, the kernel needs to do the retirement on its behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. Changes getObjectRefCount to tolerate deleted krefs (missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in zoe - secondPriceAuction -- valid input , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. closes #7212
This adds a new (failing) test of #7212, enhances some other tests to cover the same thing, and uncomments a portion of upgrade.test.js which was commented out when we discovered the bug. These will only pass when the kernel properly retires unreachable objects that have just been abandoned by their owning vat. The new test (gc-kernel-orphan.test.js) also checks that vat termination on the same crank that retires an object will not cause a panic.
If a kernel object ("koid", the object subset of krefs) is unreachable, and then becomes orphaned (either because the owning vat was terminated, or called `syscall.abandonExports`, or was upgraded and the koid was ephemeral), then retire it immediately. The argument is that the previously-owning vat can never again talk about the object, so it can never become reachable again, which is normally the point at which the owning vat would retire it. But because the owning vat is dead, it can't retire the koid by itself, the kernel needs to do the retirement on the vat's behalf. We now consolidate retirement responsibilities into processRefcounts(): when terminateVat or syscall.abandonExports use abandonKernelObjects() to mark a kref as orphaned, it also adds the kref to maybeFreeKrefs, and then processRefcounts() is responsible for noticing the kref is both orphaned and unreachable, and then notifying any importers of its retirement. I double-checked that cleanupAfterTerminatedVat will always be followed by a processRefcounts(), by virtue of either being called from processDeliveryMessage (in the crankResults.terminate clause), or from within a device invocation syscall (which only happens during a delivery, so follows the same path). We need this to ensure that any maybeFreeKrefs created by the cleanup's abandonKernelObjects() will get processed promptly. This also changes getObjectRefCount() to tolerate deleted krefs (i.e. missing `koNN.refCount`) by just returning 0,0. This fixes a potential kernel panic in the new approach, when a kref is recognizable by one vat but only reachable by a send-message on the run-queue, then becomes unreachable as that message is delivered (the run-queue held the last strong reference), and if the target vat does syscall.exit during the delivery. The decref pushes the kref onto maybeFreeKrefs, the terminateVat retires the merely-recognizable now-orphaned kref, then processRefcounts used getObjectRefCount() to grab the refcount for the now-retired (and deleted) kref, which asserted that the koNN.refCount key still existed, which didn't. This occured in "zoe - secondPriceAuction -- valid input" unit test , where the contract did syscall.exit in response to a timer wake() message sent to a single-use wakeObj. Also rename abandonKernelObject back to orphanKernelObject, the name fits better now. closes #7212
In #6696 (comment) I asked @gibson042 to add a test for a vat-upgrade -time kernel behavior which, it turns out, the kernel does not already do, as his test in #7170 (comment) discovered.
The scenario I was thinking about is:
syscall.dropImport
, but not asyscall.retireImport
dispatch.dropExport
, but maintains an internal strong reference, so does not emit asyscall.retireExport
I mistakenly believed that the kernel would then send a
dispatch.retireImport
into vatB. The reasoning is that:But, as
test-abandon-export.js
shows us, the kernel doesn't do that yet. The kernel object table is updated to show that the object has been abandoned (owner = null
), but the refcount is unchanged. The entry will remain until all importing vats retire the import themselves, and that will only happen if their WeakMap gets deleted (probably never, they're usually long-lived).So the feature to add here is for the kernel to notice when an object with
refcount=0,*
(i.e. unreachable) becomes orphaned. This can happen because the vat didsyscall.abandonExport
, or because the kernel'sprocessUpgradeVat()
abandoned the object on behalf of a vat being upgraded (#6696), or because the vat was terminated. The kernel should respond to this as if the previously-exporting vat did asyscall.retireExport
: it should find all importing/recognizing vats and send them adispatch.retireImport
.Since we're unlikely to implement this before deploying the kernel in the Vaults release, we must also be able to catch up on unretired unreachable orphans that were created and abandoned during the "deployed but not fixed" window. We'll create an upgrade handler that walks the kernel object table, looking for entries where
owner = null
andrefCount.reachable = 0
. For each one, we should perform theretireExport
chores. When complete, those entries will be deleted.We don't know how many such orphaned objects there will be. If there might be a lot of them, we should consider building an upgrade handler which does a limited amount of work in each block (perhaps some number of computrons can be budgeted towards this chore, and once that budget is exceeded, we swtich over to normal deliveries). It should start at
ko0
and work lexicographically upwards until all entries have been examined. The handler would need some persistent state to remember it's progress between blocks. The edge-triggered code would also be on the lookout for newly-orphaned or newly-unreachable objects, so if the little-at-a-time loop passed by an earlier object (orphaned but still reachable), then when that object becomes unreachable later, the GC actions should still fire.The text was updated successfully, but these errors were encountered: