XS isn't garbage-collecting objects when expected #3406

warner · 2021-06-24T18:44:21Z

We're seeing unexpected behavior from the XS garbage collector. It seems to be retaining objects used in eventual-sends, where the V8 collector lets them get collected.

I'm still characterizing this, but a simple example is the following vat:

export function buildRootObject() {
  return Far('root', {
    two(A, B) { // A=ko26 B=ko27
      return E(A).hello(B);
    },
  });
}

(A and B are simple Remotables exported by some other vat, with a dummy hello: () => {} method)

This results in two cranks on our vat: the first delivers two(A,B), and the second notifies the vat about the hello() result promise being resolved.

At the end of each crank, our liveSlots.js layer is waiting for the vat's code to become idle (with await new Promise(setImmediate)), then forcing a GC sweep, with several additional setImmediate stalls introduced that were experimentally determined to allow finalizers to run. When the finalizers run, they accumulate the vrefs in a set, and then perform syscall.dropImports().

When we run this under Node.js, the two objects are collected during the second crank, causing our vat to do a syscall.dropImports() for the vrefs that represent both A and B. When we run this under XS, neither object is collected (i.e. these finalizers were not called by the end of the second crank). Since no further deliveries are made to this vat, there are no further opportunities for the vat to inform the kernel about the objects being dropped.

The same happens if we replace the return with async/await:

export function buildRootObject() {
  return Far('root', {
    async two(A, B) { // A=ko26 B=ko27
      await E(A).hello(B);
    },
  });
}

But if we remove the return:

export function buildRootObject() {
  return Far('root', {
    two(A, B) { // A=ko26 B=ko27
      E(A).hello(B);
    },
  });
}

Then we still get the same two cranks (however nobody is watching the result promise, so no user-level code runs when the notify arrives). Node.js collects both A and B during the second crank as before. Under XS, A is collected at the end of the first crank, and B is never collected.

Ideally, both A and B would be collected at the end of the first crank: there are no references to either after the hello() message is sent.

Our standard coding style is to use async methods and await (although only at the top of the method scope, not inside loops or conditionals). So we're going to hit this case all of the time, and we need those objects to get released.

The text was updated successfully, but these errors were encountered:

warner · 2021-06-25T09:09:02Z

I tried to build up a reduced test case "from below", by starting with a simple xsnap runner, and adding code from swingset/eventual-send until it started to exhibit the failure. I wasn't able to make it fail.

I've started to try "from above", by removing things from swingset until the failure goes away. The one datapoint I have so far, which maybe will mean something to @michaelfig or @erights , is that removing the returned Promise p at the end of liveslots' queueMessage() allows both A and B to be colected in the first crank.

agoric-sdk/packages/SwingSet/src/kernel/liveSlots.js

Line 649 in 5b38615

return p;

refs Agoric/agoric-sdk#3406 Note from Patrick Soquet (at Moddable): The objects to be collected were referenced by the closures of a function, deeply nested within the reject handler of a promise. The promise was fulfilled and stored into your “pendingPromises” set. The issue seemed to be that promises retained references to handlers even when resolved. I fixed that. This patch will probably be added upstream soon, so this particular commit is temporary. Author: Patrick Soquet <ps@moddable.tech> Tested-by: Brian Warner <warner@lothar.com>

Update the packages/xsnap/moddable submodule to current public branch, plus an upcoming xsPromise.c fix. * current public branch includes a new WeakMap design, which drastically improves GC speed * the xsPromise.c fix now correctly drops rejection handlers for resolved promises, which was probably the cause of #3406 (unexpected retention of Presences used in `E()` calls, resulting in too few GC actions) refs #3406 (might even close it) refs #3118

warner · 2021-06-25T18:55:14Z

The amazing Moddable folks identified the problem and provided a fix. In Patrick's words:

"The objects to be collected were referenced by the closures of a function, deeply nested within the reject handler of a promise. The promise was fulfilled and stored into your “pendingPromises” set. The issue seemed to be that promises retained references to handlers even when resolved. I fixed that. "

This checks that our normal async / `await E()` pattern doesn't cause the target or its arguments to be retained longer than expected. refs #3406

warner · 2021-06-25T23:56:44Z

PR #3416 landed the fix for this (in commit 8eeec28). I still need to run a perf test and count the drops, hopefully there will be more of them.

This checks that our normal async / `await E()` pattern doesn't cause the target or its arguments to be retained longer than expected. refs #3406

warner · 2021-07-15T21:54:49Z

I have an example of this, however it's not XS-specific (V8 fails to drop the object too). It's basically just:

    makeInvitationTarget(zoe) {
      return E(zoe).makeInvitationZoe();
    },

(where zoe is a Presence pointing to some other vat)

The invitation object passes through liveslots (and a HandledPromise or two), but the user-level code never sees it. The vat imports the invitation from zoe, resolves the result promise for makeInvitationTarget, and is then supposed to drop it. But it doesn't drop it.

I tried commenting out the knownResolutions table update (that's a WeakMap from Promise to the resolution data.. if the result promise were somehow being kept alive, that would have also kept the invitation alive), but that didn't make the problem go away.

I'm currently suspecting some odd cycle in HandledPromise and/or the way that liveslots uses it.

The new test, 'forward to fake zoe', mimics an execution pathway in the tap-fungible-faucet load-generator task. In the loadgen task, the client asks a fungible-faucet contract (an instance of packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in the faucet contract immediately sends off a requests to Zoe (through the zcf facet) and returns the result Promise. In my analysis of the loadgen slogfile, the Invitation object (a Zoe Invitation payment) is imported into the faucet contract vat, sent back out again as the resolution of its result promise, but then never dropped. I see no good reason for the faucet contract to hold onto the object: the code doesn't even have a place to put it. I initially thought this was a problem with XS, but when I reproduced the issue in a unit test and changed it to use a Node.js worker, the problem remained. So I'm now suspicious of liveslots, or HandledPromise, or something in their interaction. The test first talks to a fake Zoe vat to export the simulated Invitation object and learn its kref. Then it instructs the bootstrap vat to ask vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once the kernel is done, and vat-target should have dropped the kref, the test examines the clists. The test would pass if the vat-target clist did not include the Invitation object's kref. Instead, vat-target still references the kref, so the test fails. I experimented with making a few changes in liveslots, without success: * Disable the unfulfilledHandler's applyMethod (by just throwing an error upon entry). The test case never sends a method to a promise, so this code path isn't exercised. * Disable the `knownResolutions.set` in `thenHandler`, which stashes the resolution of a promise for a little while (it's a WeakMap from Promise object to its resolution) so cycles in the resolution graph can be serialized properly. We don't have cross-promise references in our test case, so this is unnecessary. But removing it didn't help. refs #3406

warner · 2021-07-16T17:48:26Z

I've got a failing test in branch 3406-xs-not-gc. With some extra debug prints, I see the two deliveries in the vat-target which is incorrectly holding on to the just-passing-through Presence:

calling makeInvitation
 ls-v2 ==== deliver [o+0].makeInvitationTarget
 ls-v2 +hpid0 (makeImportedPresence)
 ls-v2 adding o-50
 ls-v2  hpid0 res/rej
 ls-v2 +hpid1 (makeImportedPromise)
 ls-v2 processDeadSet ---
 ls-v2 ==== notify p+5
 ls-v2 notifyOnePromise p+5
 ls-v2 +hpid2 (makeImportedPresence)
 ls-v2 adding o-51
 ls-v2 notifyOnePromise did unser p+5
 ls-v2 pRec.resolve p+5 Alleged: Zoe Invitation payment <[Object: null prototype] [Alleged: Zoe Invitation payment] {}> [Alleged: Zoe Invitation payment] {}
 ls-v2  hpid2 res/rej
 ls-v2  hpid1 res/rej
 ls-v2 knownResolutions.set p-60 Alleged: Zoe Invitation payment <[Object: null prototype] [Alleged: Zoe Invitation payment] {}> [Alleged: Zoe Invitation payment] {}
 ls-v2 -hpid0
 ls-v2 processDeadSet ---
 ls-v2 finalize o-50

The +hpid lines show the creation of each HandledPromise. In the first crank (the dispatch.deliver for makeInvitationTarget), it creates the first HP (hpid0) to use for the zoe argument, which gets resolved immediately. That HP itself is not released until the end of the second delivery, which seems longer than expected but still good enough. Then the second HP (hpid1) is created to hold the result promise of the outgoing syscall.send('makeInvitationZoe').

In the second crank (the notify which resolves that result promise), we build a third HP (hpid2) and resolve it promptly to build the invitation Presence. We resolve hpid0, which causes a syscall.resolve() to pass the Invitation back to the caller of makeInvitationTarget.

Neither hpid1 nor hpid2 are getting finalized at any point. This might be the root cause, if the resolution value of hpid2 (the invitation is being held too.

warner · 2021-07-16T18:56:02Z

Big thanks to @michaelfig , we walked through this and found a problem in liveslots. Details will be in #3482. I'll leave this ticket open for an actual XS-specific problems we find once #3482 is fixed.

The new test, 'forward to fake zoe', mimics an execution pathway in the tap-fungible-faucet load-generator task. In the loadgen task, the client asks a fungible-faucet contract (an instance of packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in the faucet contract immediately sends off a requests to Zoe (through the zcf facet) and returns the result Promise. In my analysis of the loadgen slogfile, the Invitation object (a Zoe Invitation payment) is imported into the faucet contract vat, sent back out again as the resolution of its result promise, but then never dropped. I see no good reason for the faucet contract to hold onto the object: the code doesn't even have a place to put it. I initially thought this was a problem with XS (#3406), but when I reproduced the issue in a unit test and changed it to use a Node.js worker, the problem remained. We traced it down to a problem in liveslots (#3482), which will be fixed by the upcoming commit. This test will fail until that commit. The test first talks to a fake Zoe vat to export the simulated Invitation object and learn its kref. Then it instructs the bootstrap vat to ask vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once the kernel is done, and vat-target should have dropped the kref, the test examines the clists. The test would pass if the vat-target clist did not include the Invitation object's kref. Instead, vat-target still references the kref, so the test fails.

dckc · 2021-07-22T19:18:52Z

Big thanks to @michaelfig , we walked through this and found a problem in liveslots. Details will be in #3482. I'll leave this ticket open for an actual XS-specific problems we find once #3482 is fixed.

Did you find any XS-specific problems?

warner · 2021-08-03T17:37:54Z

#3488 is about a leak we found in the "vault" loadgen task, however we don't know if it's XS-specific or not. We decided to fix an API usage issue in the treasury contract first, then reevaluate to see if the leak is still present.

I'm going to close this ticket, and open a new one if/when we find a new leak.

This "forward to fake zoe" in gc-vat.test was added to demonstrate a fix for #3482, in which liveslots was mishandling an intermediate promise by retaining it forever, which made us retain objects that appear in eventual-send results forever. This problem was discovered while investigating an unrelated XS engine bug (#3406), so "is this specific to a single engine?" was on our mind, and I wasn't sure that we were dealing with two independent bugs until I wrote the test and showed that it failed on both V8 and XS. So the test was originally written with a commented-out `managerType:` option to make it easy to switch back and forth between `local` and `xs-worker`. That switch was left in the `local` state, probably because it's slightly faster. What we've learned is that V8 sometimes holds on to objects despite a forced GC pass (see #5575 and #3240), and somehow it only seems to fail in CI runs (and only for people other than me). Our usual response is to make the test use XS instead of V8, either by setting `creationOptions.managerType: 'xs-worker'` on the individual vat, or by setting `defaultManagerType: 'xs-worker'` to set it for all vats. This PR uses the first approach, changing just the one vat being exercised (which should be marginally cheaper than making all vats use XS). closes #9392

) This "forward to fake zoe" in gc-vat.test was added to demonstrate a fix for #3482, in which liveslots was mishandling an intermediate promise by retaining it forever, which made us retain objects that appear in eventual-send results forever. This problem was discovered while investigating an unrelated XS engine bug (#3406), so "is this specific to a single engine?" was on our mind, and I wasn't sure that we were dealing with two independent bugs until I wrote the test and showed that it failed on both V8 and XS. So the test was originally written with a commented-out `managerType:` option to make it easy to switch back and forth between `local` and `xs-worker`. That switch was left in the `local` state, probably because it's slightly faster. What we've learned is that V8 sometimes holds on to objects despite a forced GC pass (see #5575 and #3240), and somehow it only seems to fail in CI runs (and only for people other than me). Our usual response is to make the test use XS instead of V8, either by setting `creationOptions.managerType: 'xs-worker'` on the individual vat, or by setting `defaultManagerType: 'xs-worker'` to set it for all vats. This PR uses the first approach, changing just the one vat being exercised (which should be marginally cheaper than making all vats use XS). closes #9392

warner added bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool labels Jun 24, 2021

This was referenced Jun 25, 2021

fix(xsnap): update XS: new WeakMap design, fixed Promise drops #3415

Merged

Track recognizable objects used by VOM so other objects can be GC'd #3409

Merged

warner added a commit that referenced this issue Jun 25, 2021

chore(swingset): add test of GC behavior in xs vat

18f28d7

This checks that our normal async / `await E()` pattern doesn't cause the target or its arguments to be retained longer than expected. refs #3406

warner mentioned this issue Jun 25, 2021

chore(swingset): add test of GC behavior in xs vat #3417

Merged

warner added a commit that referenced this issue Jun 26, 2021

chore(swingset): add test of GC behavior in xs vat

c0f7d65

This checks that our normal async / `await E()` pattern doesn't cause the target or its arguments to be retained longer than expected. refs #3406

warner added a commit that referenced this issue Jun 26, 2021

chore(swingset): add test of GC behavior in xs vat

d99d8e3

This checks that our normal async / `await E()` pattern doesn't cause the target or its arguments to be retained longer than expected. refs #3406

warner mentioned this issue Jul 16, 2021

pendingPromises retains interior result Promise by mistake #3482

Closed

dckc added this to the Testnet: Metering Phase milestone Aug 2, 2021

dckc self-assigned this Aug 2, 2021

warner closed this as completed Aug 3, 2021

warner mentioned this issue Jun 1, 2024

flake in SwingSet gc test #9392

Open

warner mentioned this issue Jun 1, 2024

test(swingset): fix GC test flake by forcing xsnap worker, not V8 #9442

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XS isn't garbage-collecting objects when expected #3406

XS isn't garbage-collecting objects when expected #3406

warner commented Jun 24, 2021

warner commented Jun 25, 2021

warner commented Jun 25, 2021

warner commented Jun 25, 2021

warner commented Jul 15, 2021

warner commented Jul 16, 2021

warner commented Jul 16, 2021

dckc commented Jul 22, 2021

warner commented Aug 3, 2021

XS isn't garbage-collecting objects when expected #3406

XS isn't garbage-collecting objects when expected #3406

Comments

warner commented Jun 24, 2021

warner commented Jun 25, 2021

warner commented Jun 25, 2021

warner commented Jun 25, 2021

warner commented Jul 15, 2021

warner commented Jul 16, 2021

warner commented Jul 16, 2021

dckc commented Jul 22, 2021

warner commented Aug 3, 2021