-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vat termination leaks promises #10261
Comments
I took a quick look at the mainnet DB. We have two vats which are already unused, which we'll terminate as soon as the demolition perimeter is secure: v45-auctioneer and v44-auctioneer.governor. These currently have 6 and 0 outstanding promises, respectively. We're also planning to cease using (and later terminate) v29-ATOM-USD_price_feed and v46-scaledPriceAuthority-ATOM. These are still in use, so they might have more outstanding promises than they will once we stop talking to them, but they currently have 6 and 12 promises outstanding. So, the overall size of the leak would not be huge. |
Oh, I think it's worse than that.. even externally-decided promises, known to the dead vat (and probably subscribed to), will be leaked, when we delete the c-list entry without decrementing the refcount. That means the second approach is definitely the better one. At termination time, the kernel walks all the promises decided by the vat, for rejection, but that's not a good place to walk the entire c-list. The slow-cleanup handler is a much better place for that kind of scan. |
Two tests are updated to exercise the cleanup of promise c-list entries during vat termination. `terminate.test.js` adds some promises to the c-list and then checks their refcounts after termination, to demonstrate that bug #10261 is leaking a refcount when it deletes the dead vat's c-list entry without also decrementing the refcount. `slow-termination.test.js` adds a number of promises to the c-list, and expectes the budget-limited cleanup to spend some time on promises. Both tests are marked as failing until the code fix is landed in the next commit.
Previously, when a vat was terminated, and we delete the promise c-list entries from its old state, the cleanup code was failing to decrement the kpid's refcount properly. This resulted in a leak: those promises could never be retired. This commit updates the vat cleanup code to add a new phase, named `promises`. This executes after `exports` and `imports`, but before `kv`, and is responsible for both deleting the c-list entries and also decrementing the refcounts of the corresponding promises. We do this slowly, like we do exports and imports, because we don't know how many there might be, and because those promise records might hold references to other objects (in the resolution data), which could trigger additional work. However, this work is unlikely to be significant: the run-queue is usually empty, so these outstanding promises are probably unresolved, and thus cannot beholding resolution data. All promises *decided* by the dead vat are rejected by the kernel immediately during vat termination, because those rejections are visible to userspace in other vats. In contrast, freeing the promise records is *not* visible to userspace, just like how freeing imports or exports are not visible to userspace, so this cleanup is safe to do at a leisurely pace, rate-limited by `runPolicy.allowCleanup`. The docs are updated to reflect the new `runPolicy` API: * `budget.promises` is new, and respected by slow cleanup * `work.promises` is reported to `runPolicy.didCleanup()` The 'test.failing' marker was removed from the previously updated tests. I don't intend to add any remediation code: it requires a full refcount audit to find such promises, and the mainnet kernel has only ever terminated one vat so far, so I believe there cannot be very many leaked promises, if any. Once this fix is applied, no new leaks will occur. fixes #10261
Two tests are updated to exercise the cleanup of promise c-list entries during vat termination. `terminate.test.js` adds some promises to the c-list and then checks their refcounts after termination, to demonstrate that bug #10261 is leaking a refcount when it deletes the dead vat's c-list entry without also decrementing the refcount. `slow-termination.test.js` adds a number of promises to the c-list, and expects the budget-limited cleanup to spend a phase on promises. Both tests are marked as failing until the code fix is landed in the next commit.
Previously, when a vat was terminated, and we delete the promise c-list entries from its old state, the cleanup code was failing to decrement the kpid's refcount properly. This resulted in a leak: those promises could never be retired. This commit updates the vat cleanup code to add a new phase, named `promises`. This executes after `exports` and `imports`, but before `kv`, and is responsible for both deleting the c-list entries and also decrementing the refcounts of the corresponding promises. We do this slowly, like we do exports and imports, because we don't know how many there might be, and because those promise records might hold references to other objects (in the resolution data), which could trigger additional work. However, this work is unlikely to be significant: the run-queue is usually empty, so these outstanding promises are probably unresolved, and thus cannot beholding resolution data. All promises *decided* by the dead vat are rejected by the kernel immediately during vat termination, because those rejections are visible to userspace in other vats. In contrast, freeing the promise records is *not* visible to userspace, just like how freeing imports or exports are not visible to userspace, so this cleanup is safe to do at a leisurely pace, rate-limited by `runPolicy.allowCleanup`. The docs are updated to reflect the new `runPolicy` API: * `budget.promises` is new, and respected by slow cleanup * `work.promises` is reported to `runPolicy.didCleanup()` The 'test.failing' marker was removed from the previously updated tests. I don't intend to add any remediation code: it requires a full refcount audit to find such promises, and the mainnet kernel has only ever terminated one vat so far, so I believe there cannot be very many leaked promises, if any. Once this fix is applied, no new leaks will occur. fixes #10261
…10268) Previously, when a vat was terminated, and we delete the promise c-list entries from its old state, the cleanup code was failing to decrement the kpid's refcount properly. This resulted in a leak: those promises could never be retired. This commit updates the vat cleanup code to add a new phase, named `promises`. This executes after `exports` and `imports`, but before `kv`, and is responsible for both deleting the c-list entries and also decrementing the refcounts of the corresponding promises. We do this slowly, like we do exports and imports, because we don't know how many there might be, and because those promise records might hold references to other objects (in the resolution data), which could trigger additional work. However, this work is unlikely to be significant: the run-queue is usually empty, so these outstanding promises are probably unresolved, and thus cannot beholding resolution data. All promises *decided* by the dead vat are rejected by the kernel immediately during vat termination, because those rejections are visible to userspace in other vats. In contrast, freeing the promise records is *not* visible to userspace, just like how freeing imports or exports are not visible to userspace, so this cleanup is safe to do at a leisurely pace, rate-limited by `runPolicy.allowCleanup`. The docs are updated to reflect the new `runPolicy` API: * `budget.promises` is new, and respected by slow cleanup * `work.promises` is reported to `runPolicy.didCleanup()` I don't intend to add any remediation code: it requires a full refcount audit to find such promises, and the mainnet kernel has only ever terminated one vat so far, so I believe there cannot be very many leaked promises, if any. Once this fix is applied, no new leaks will occur. fixes #10261
Describe the bug
While writing tests for remediation of #9039 , I noticed a second problem. When a vat is terminated, and the kernel rejects its remaining outstanding (vat-decided) promises, we delete the late vat's c-list entry without decrementing the corresponding refcount. That means these promises will be left with a final refcount of 1, not 0, causing the promise (and it's rejected value, e.g. the DisconnectionObject) to be retained forever.
To Reproduce
I'm modifying
SwingSet/test/vat-admin/terminate/terminate.test.js
to demonstrate the problem, by gleaning the kpid of a dead-vat -decided promise, and then examining the kvStore afterwards to see whether or not the kernel promise table entry is still present.Impact
On chain, we've only ever deleted one vat (a provisioning vat, during the initial bootstrap run at chain launch, back in 2023). So the impact is minimal so far. But we're intending to terminate some large vats in the near future, so we should get this fixed before we perform those terminations.
Cause
When the kernel terminates a vat, we run a function named
terminateVat()
in kernel.js . Originally, before we landed slow-termination, this built a list ofdeadPromises
by scanning for all promises decided by the vatID, then usedkernelKeeper.cleanupAfterTerminatedVat()
to delete all c-list entries (and vatstore), then submit each dead kpid toresolveToError()
. ThisresolveToError
callsdoResolve
with a rejection value, which is the same function that is used bysyscall.resolve
.The critical difference is that when a vat does
syscall.resolve()
, the vatTranslator (which converts the VatSyscallObject to a KernelSyscallObject, translating vrefs to krefs) would also retire the c-list entries immediately after translation, usingvatKeeper.deleteCListEntriesForKernelSlots()
. This function decrements the entry's refcount just before deleting the kvStore keys, allowing the usualmaybeFreeKrefs
processing to clean things up.But
terminateVat
goes directly toresolveToError
/doResolve
without going through a vref/kref translator, so nothing ever callsdeleteCListEntriesForKernelSlots
. It callscleanupAfterTerminatedVat
, and that function does decrement refcounts for kernel objects, but it does not do anything special for kernel promises, it merely deletes the forward (vNN.c.kpNN
) and reverse kvStore entries without touching refcounts.When we landed slow termination, we retained this pattern. All the dead vat's promises are rejected immediately, inside
terminateVat()
, but we leave c-list handling to the slow/backgroundkernelKeeper.cleanupAfterTerminatedVat()
, which behaves just like the old version, and doesn't perform refcount handling on promise c-list entries.Likely Fix
One path would be to properly retire the promise c-list entries during
terminateVat()
. We would change it to usevatKeeper.deleteCListEntriesForKernelSlots(deadPromises)
. I'd have to think about the right ordering: should be retire those entries before or after doingresolveToError
on all of them? The vat might have been self-subscribed to some of those promises, soresolveToError
will enqueue notifications to the dead vat. I'm pretty sure those will be safely ignored when they reach the front of the run-queue, but we'd need to test it.A second path would be to change
cleanupAfterTerminatedVat()
to have a'promise'
phase, probably just after the'import'
and'export'
phases, but before the'kv'
phase. Currently, thevNN.c.kpNN
promise c-list entries are bulk-deleted during thekv
phase, because I didn't think those deletions had side-effects (I thought the refcounts were decremented immediately, duringterminateVat()
, so it's safe to delete them at a faster rate. This would be in keeping with our slow-deletion patterns: all userspace-visible consequences of the termination happen immediately, duringterminateVat()
, but non-userspace-visible consequences (GC drops/retires) are allowed to happen slowly, afterwards.I'm inclined to take the second path, although it probably means more work: adding a new phase involves cosmic-swingset code to make that limit be configurable.
The text was updated successfully, but these errors were encountered: