Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XS isn't garbage-collecting objects when expected #3406

Closed
warner opened this issue Jun 24, 2021 · 8 comments
Closed

XS isn't garbage-collecting objects when expected #3406

warner opened this issue Jun 24, 2021 · 8 comments
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool

Comments

@warner
Copy link
Member

warner commented Jun 24, 2021

We're seeing unexpected behavior from the XS garbage collector. It seems to be retaining objects used in eventual-sends, where the V8 collector lets them get collected.

I'm still characterizing this, but a simple example is the following vat:

export function buildRootObject() {
  return Far('root', {
    two(A, B) { // A=ko26 B=ko27
      return E(A).hello(B);
    },
  });
}

(A and B are simple Remotables exported by some other vat, with a dummy hello: () => {} method)

This results in two cranks on our vat: the first delivers two(A,B), and the second notifies the vat about the hello() result promise being resolved.

At the end of each crank, our liveSlots.js layer is waiting for the vat's code to become idle (with await new Promise(setImmediate)), then forcing a GC sweep, with several additional setImmediate stalls introduced that were experimentally determined to allow finalizers to run. When the finalizers run, they accumulate the vrefs in a set, and then perform syscall.dropImports().

When we run this under Node.js, the two objects are collected during the second crank, causing our vat to do a syscall.dropImports() for the vrefs that represent both A and B. When we run this under XS, neither object is collected (i.e. these finalizers were not called by the end of the second crank). Since no further deliveries are made to this vat, there are no further opportunities for the vat to inform the kernel about the objects being dropped.

The same happens if we replace the return with async/await:

export function buildRootObject() {
  return Far('root', {
    async two(A, B) { // A=ko26 B=ko27
      await E(A).hello(B);
    },
  });
}

But if we remove the return:

export function buildRootObject() {
  return Far('root', {
    two(A, B) { // A=ko26 B=ko27
      E(A).hello(B);
    },
  });
}

Then we still get the same two cranks (however nobody is watching the result promise, so no user-level code runs when the notify arrives). Node.js collects both A and B during the second crank as before. Under XS, A is collected at the end of the first crank, and B is never collected.

Ideally, both A and B would be collected at the end of the first crank: there are no references to either after the hello() message is sent.

Our standard coding style is to use async methods and await (although only at the top of the method scope, not inside loops or conditionals). So we're going to hit this case all of the time, and we need those objects to get released.

@warner warner added bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool labels Jun 24, 2021
@warner
Copy link
Member Author

warner commented Jun 25, 2021

I tried to build up a reduced test case "from below", by starting with a simple xsnap runner, and adding code from swingset/eventual-send until it started to exhibit the failure. I wasn't able to make it fail.

I've started to try "from above", by removing things from swingset until the failure goes away. The one datapoint I have so far, which maybe will mean something to @michaelfig or @erights , is that removing the returned Promise p at the end of liveslots' queueMessage() allows both A and B to be colected in the first crank.

warner added a commit to agoric-labs/moddable that referenced this issue Jun 25, 2021
refs Agoric/agoric-sdk#3406

Note from Patrick Soquet (at Moddable):

  The objects to be collected were referenced by the closures of a function,
  deeply nested within the reject handler of a promise. The promise was
  fulfilled and stored into your “pendingPromises” set. The issue seemed to
  be that promises retained references to handlers even when resolved. I
  fixed that.

This patch will probably be added upstream soon, so this particular commit is
temporary.

Author: Patrick Soquet <ps@moddable.tech>
Tested-by: Brian Warner <warner@lothar.com>
warner added a commit that referenced this issue Jun 25, 2021
Update the packages/xsnap/moddable submodule to current public branch, plus
an upcoming xsPromise.c fix.

* current public branch includes a new WeakMap design, which drastically
improves GC speed
* the xsPromise.c fix now correctly drops rejection handlers for resolved
promises, which was probably the cause of #3406 (unexpected retention of
Presences used in `E()` calls, resulting in too few GC actions)

refs #3406  (might even close it)
refs #3118
@warner
Copy link
Member Author

warner commented Jun 25, 2021

The amazing Moddable folks identified the problem and provided a fix. In Patrick's words:

"The objects to be collected were referenced by the closures of a function, deeply nested within the reject handler of a promise. The promise was fulfilled and stored into your “pendingPromises” set. The issue seemed to be that promises retained references to handlers even when resolved. I fixed that. "

warner added a commit that referenced this issue Jun 25, 2021
This checks that our normal async / `await E()` pattern doesn't cause the
target or its arguments to be retained longer than expected.

refs #3406
@warner
Copy link
Member Author

warner commented Jun 25, 2021

PR #3416 landed the fix for this (in commit 8eeec28). I still need to run a perf test and count the drops, hopefully there will be more of them.

warner added a commit that referenced this issue Jun 26, 2021
This checks that our normal async / `await E()` pattern doesn't cause the
target or its arguments to be retained longer than expected.

refs #3406
warner added a commit that referenced this issue Jun 26, 2021
This checks that our normal async / `await E()` pattern doesn't cause the
target or its arguments to be retained longer than expected.

refs #3406
@warner
Copy link
Member Author

warner commented Jul 15, 2021

I have an example of this, however it's not XS-specific (V8 fails to drop the object too). It's basically just:

    makeInvitationTarget(zoe) {
      return E(zoe).makeInvitationZoe();
    },

(where zoe is a Presence pointing to some other vat)

The invitation object passes through liveslots (and a HandledPromise or two), but the user-level code never sees it. The vat imports the invitation from zoe, resolves the result promise for makeInvitationTarget, and is then supposed to drop it. But it doesn't drop it.

I tried commenting out the knownResolutions table update (that's a WeakMap from Promise to the resolution data.. if the result promise were somehow being kept alive, that would have also kept the invitation alive), but that didn't make the problem go away.

I'm currently suspecting some odd cycle in HandledPromise and/or the way that liveslots uses it.

warner added a commit that referenced this issue Jul 16, 2021
The new test, 'forward to fake zoe', mimics an execution pathway in the
tap-fungible-faucet load-generator task. In the loadgen task, the client asks
a fungible-faucet contract (an instance of
packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in
the faucet contract immediately sends off a requests to Zoe (through the zcf
facet) and returns the result Promise.

In my analysis of the loadgen slogfile, the Invitation object (a Zoe
Invitation payment) is imported into the faucet contract vat, sent back out
again as the resolution of its result promise, but then never dropped. I see
no good reason for the faucet contract to hold onto the object: the code
doesn't even have a place to put it.

I initially thought this was a problem with XS, but when I reproduced the
issue in a unit test and changed it to use a Node.js worker, the problem
remained. So I'm now suspicious of liveslots, or HandledPromise, or something
in their interaction.

The test first talks to a fake Zoe vat to export the simulated Invitation
object and learn its kref. Then it instructs the bootstrap vat to ask
vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once
the kernel is done, and vat-target should have dropped the kref, the test
examines the clists. The test would pass if the vat-target clist did not
include the Invitation object's kref. Instead, vat-target still references
the kref, so the test fails.

I experimented with making a few changes in liveslots, without success:

* Disable the unfulfilledHandler's applyMethod (by just throwing an error
upon entry). The test case never sends a method to a promise, so this code
path isn't exercised.
* Disable the `knownResolutions.set` in `thenHandler`, which stashes the
resolution of a promise for a little while (it's a WeakMap from Promise
object to its resolution) so cycles in the resolution graph can be serialized
properly. We don't have cross-promise references in our test case, so this is
unnecessary. But removing it didn't help.

refs #3406
@warner
Copy link
Member Author

warner commented Jul 16, 2021

I've got a failing test in branch 3406-xs-not-gc. With some extra debug prints, I see the two deliveries in the vat-target which is incorrectly holding on to the just-passing-through Presence:

calling makeInvitation
 ls-v2 ==== deliver [o+0].makeInvitationTarget
 ls-v2 +hpid0 (makeImportedPresence)
 ls-v2 adding o-50
 ls-v2  hpid0 res/rej
 ls-v2 +hpid1 (makeImportedPromise)
 ls-v2 processDeadSet ---
 ls-v2 ==== notify p+5
 ls-v2 notifyOnePromise p+5
 ls-v2 +hpid2 (makeImportedPresence)
 ls-v2 adding o-51
 ls-v2 notifyOnePromise did unser p+5
 ls-v2 pRec.resolve p+5 Alleged: Zoe Invitation payment <[Object: null prototype] [Alleged: Zoe Invitation payment] {}> [Alleged: Zoe Invitation payment] {}
 ls-v2  hpid2 res/rej
 ls-v2  hpid1 res/rej
 ls-v2 knownResolutions.set p-60 Alleged: Zoe Invitation payment <[Object: null prototype] [Alleged: Zoe Invitation payment] {}> [Alleged: Zoe Invitation payment] {}
 ls-v2 -hpid0
 ls-v2 processDeadSet ---
 ls-v2 finalize o-50

The +hpid lines show the creation of each HandledPromise. In the first crank (the dispatch.deliver for makeInvitationTarget), it creates the first HP (hpid0) to use for the zoe argument, which gets resolved immediately. That HP itself is not released until the end of the second delivery, which seems longer than expected but still good enough. Then the second HP (hpid1) is created to hold the result promise of the outgoing syscall.send('makeInvitationZoe').

In the second crank (the notify which resolves that result promise), we build a third HP (hpid2) and resolve it promptly to build the invitation Presence. We resolve hpid0, which causes a syscall.resolve() to pass the Invitation back to the caller of makeInvitationTarget.

Neither hpid1 nor hpid2 are getting finalized at any point. This might be the root cause, if the resolution value of hpid2 (the invitation is being held too.

@warner
Copy link
Member Author

warner commented Jul 16, 2021

Big thanks to @michaelfig , we walked through this and found a problem in liveslots. Details will be in #3482. I'll leave this ticket open for an actual XS-specific problems we find once #3482 is fixed.

warner added a commit that referenced this issue Jul 17, 2021
The new test, 'forward to fake zoe', mimics an execution pathway in the
tap-fungible-faucet load-generator task. In the loadgen task, the client asks
a fungible-faucet contract (an instance of
packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in
the faucet contract immediately sends off a requests to Zoe (through the zcf
facet) and returns the result Promise.

In my analysis of the loadgen slogfile, the Invitation object (a Zoe
Invitation payment) is imported into the faucet contract vat, sent back out
again as the resolution of its result promise, but then never dropped. I see
no good reason for the faucet contract to hold onto the object: the code
doesn't even have a place to put it.

I initially thought this was a problem with XS (#3406), but when I reproduced
the issue in a unit test and changed it to use a Node.js worker, the problem
remained. We traced it down to a problem in liveslots (#3482), which will be
fixed by the upcoming commit. This test will fail until that commit.

The test first talks to a fake Zoe vat to export the simulated Invitation
object and learn its kref. Then it instructs the bootstrap vat to ask
vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once
the kernel is done, and vat-target should have dropped the kref, the test
examines the clists. The test would pass if the vat-target clist did not
include the Invitation object's kref. Instead, vat-target still references
the kref, so the test fails.
warner added a commit that referenced this issue Jul 18, 2021
The new test, 'forward to fake zoe', mimics an execution pathway in the
tap-fungible-faucet load-generator task. In the loadgen task, the client asks
a fungible-faucet contract (an instance of
packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in
the faucet contract immediately sends off a requests to Zoe (through the zcf
facet) and returns the result Promise.

In my analysis of the loadgen slogfile, the Invitation object (a Zoe
Invitation payment) is imported into the faucet contract vat, sent back out
again as the resolution of its result promise, but then never dropped. I see
no good reason for the faucet contract to hold onto the object: the code
doesn't even have a place to put it.

I initially thought this was a problem with XS (#3406), but when I reproduced
the issue in a unit test and changed it to use a Node.js worker, the problem
remained. We traced it down to a problem in liveslots (#3482), which will be
fixed by the upcoming commit. This test will fail until that commit.

The test first talks to a fake Zoe vat to export the simulated Invitation
object and learn its kref. Then it instructs the bootstrap vat to ask
vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once
the kernel is done, and vat-target should have dropped the kref, the test
examines the clists. The test would pass if the vat-target clist did not
include the Invitation object's kref. Instead, vat-target still references
the kref, so the test fails.
warner added a commit that referenced this issue Jul 22, 2021
The new test, 'forward to fake zoe', mimics an execution pathway in the
tap-fungible-faucet load-generator task. In the loadgen task, the client asks
a fungible-faucet contract (an instance of
packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in
the faucet contract immediately sends off a requests to Zoe (through the zcf
facet) and returns the result Promise.

In my analysis of the loadgen slogfile, the Invitation object (a Zoe
Invitation payment) is imported into the faucet contract vat, sent back out
again as the resolution of its result promise, but then never dropped. I see
no good reason for the faucet contract to hold onto the object: the code
doesn't even have a place to put it.

I initially thought this was a problem with XS (#3406), but when I reproduced
the issue in a unit test and changed it to use a Node.js worker, the problem
remained. We traced it down to a problem in liveslots (#3482), which will be
fixed by the upcoming commit. This test will fail until that commit.

The test first talks to a fake Zoe vat to export the simulated Invitation
object and learn its kref. Then it instructs the bootstrap vat to ask
vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once
the kernel is done, and vat-target should have dropped the kref, the test
examines the clists. The test would pass if the vat-target clist did not
include the Invitation object's kref. Instead, vat-target still references
the kref, so the test fails.
warner added a commit that referenced this issue Jul 22, 2021
The new test, 'forward to fake zoe', mimics an execution pathway in the
tap-fungible-faucet load-generator task. In the loadgen task, the client asks
a fungible-faucet contract (an instance of
packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in
the faucet contract immediately sends off a requests to Zoe (through the zcf
facet) and returns the result Promise.

In my analysis of the loadgen slogfile, the Invitation object (a Zoe
Invitation payment) is imported into the faucet contract vat, sent back out
again as the resolution of its result promise, but then never dropped. I see
no good reason for the faucet contract to hold onto the object: the code
doesn't even have a place to put it.

I initially thought this was a problem with XS (#3406), but when I reproduced
the issue in a unit test and changed it to use a Node.js worker, the problem
remained. We traced it down to a problem in liveslots (#3482), which will be
fixed by the upcoming commit. This test will fail until that commit.

The test first talks to a fake Zoe vat to export the simulated Invitation
object and learn its kref. Then it instructs the bootstrap vat to ask
vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once
the kernel is done, and vat-target should have dropped the kref, the test
examines the clists. The test would pass if the vat-target clist did not
include the Invitation object's kref. Instead, vat-target still references
the kref, so the test fails.
warner added a commit that referenced this issue Jul 22, 2021
The new test, 'forward to fake zoe', mimics an execution pathway in the
tap-fungible-faucet load-generator task. In the loadgen task, the client asks
a fungible-faucet contract (an instance of
packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in
the faucet contract immediately sends off a requests to Zoe (through the zcf
facet) and returns the result Promise.

In my analysis of the loadgen slogfile, the Invitation object (a Zoe
Invitation payment) is imported into the faucet contract vat, sent back out
again as the resolution of its result promise, but then never dropped. I see
no good reason for the faucet contract to hold onto the object: the code
doesn't even have a place to put it.

I initially thought this was a problem with XS (#3406), but when I reproduced
the issue in a unit test and changed it to use a Node.js worker, the problem
remained. We traced it down to a problem in liveslots (#3482), which will be
fixed by the upcoming commit. This test will fail until that commit.

The test first talks to a fake Zoe vat to export the simulated Invitation
object and learn its kref. Then it instructs the bootstrap vat to ask
vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once
the kernel is done, and vat-target should have dropped the kref, the test
examines the clists. The test would pass if the vat-target clist did not
include the Invitation object's kref. Instead, vat-target still references
the kref, so the test fails.
warner added a commit that referenced this issue Jul 22, 2021
The new test, 'forward to fake zoe', mimics an execution pathway in the
tap-fungible-faucet load-generator task. In the loadgen task, the client asks
a fungible-faucet contract (an instance of
packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in
the faucet contract immediately sends off a requests to Zoe (through the zcf
facet) and returns the result Promise.

In my analysis of the loadgen slogfile, the Invitation object (a Zoe
Invitation payment) is imported into the faucet contract vat, sent back out
again as the resolution of its result promise, but then never dropped. I see
no good reason for the faucet contract to hold onto the object: the code
doesn't even have a place to put it.

I initially thought this was a problem with XS (#3406), but when I reproduced
the issue in a unit test and changed it to use a Node.js worker, the problem
remained. We traced it down to a problem in liveslots (#3482), which will be
fixed by the upcoming commit. This test will fail until that commit.

The test first talks to a fake Zoe vat to export the simulated Invitation
object and learn its kref. Then it instructs the bootstrap vat to ask
vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once
the kernel is done, and vat-target should have dropped the kref, the test
examines the clists. The test would pass if the vat-target clist did not
include the Invitation object's kref. Instead, vat-target still references
the kref, so the test fails.
warner added a commit that referenced this issue Jul 22, 2021
The new test, 'forward to fake zoe', mimics an execution pathway in the
tap-fungible-faucet load-generator task. In the loadgen task, the client asks
a fungible-faucet contract (an instance of
packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in
the faucet contract immediately sends off a requests to Zoe (through the zcf
facet) and returns the result Promise.

In my analysis of the loadgen slogfile, the Invitation object (a Zoe
Invitation payment) is imported into the faucet contract vat, sent back out
again as the resolution of its result promise, but then never dropped. I see
no good reason for the faucet contract to hold onto the object: the code
doesn't even have a place to put it.

I initially thought this was a problem with XS (#3406), but when I reproduced
the issue in a unit test and changed it to use a Node.js worker, the problem
remained. We traced it down to a problem in liveslots (#3482), which will be
fixed by the upcoming commit. This test will fail until that commit.

The test first talks to a fake Zoe vat to export the simulated Invitation
object and learn its kref. Then it instructs the bootstrap vat to ask
vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once
the kernel is done, and vat-target should have dropped the kref, the test
examines the clists. The test would pass if the vat-target clist did not
include the Invitation object's kref. Instead, vat-target still references
the kref, so the test fails.
warner added a commit that referenced this issue Jul 22, 2021
The new test, 'forward to fake zoe', mimics an execution pathway in the
tap-fungible-faucet load-generator task. In the loadgen task, the client asks
a fungible-faucet contract (an instance of
packages/zoe/src/contracts/mintPayments.js) for an Invitation. The method in
the faucet contract immediately sends off a requests to Zoe (through the zcf
facet) and returns the result Promise.

In my analysis of the loadgen slogfile, the Invitation object (a Zoe
Invitation payment) is imported into the faucet contract vat, sent back out
again as the resolution of its result promise, but then never dropped. I see
no good reason for the faucet contract to hold onto the object: the code
doesn't even have a place to put it.

I initially thought this was a problem with XS (#3406), but when I reproduced
the issue in a unit test and changed it to use a Node.js worker, the problem
remained. We traced it down to a problem in liveslots (#3482), which will be
fixed by the upcoming commit. This test will fail until that commit.

The test first talks to a fake Zoe vat to export the simulated Invitation
object and learn its kref. Then it instructs the bootstrap vat to ask
vat-target for an invitation, and vat-target delegates to vat-fake-zoe. Once
the kernel is done, and vat-target should have dropped the kref, the test
examines the clists. The test would pass if the vat-target clist did not
include the Invitation object's kref. Instead, vat-target still references
the kref, so the test fails.
@dckc
Copy link
Member

dckc commented Jul 22, 2021

Big thanks to @michaelfig , we walked through this and found a problem in liveslots. Details will be in #3482. I'll leave this ticket open for an actual XS-specific problems we find once #3482 is fixed.

Did you find any XS-specific problems?

@dckc dckc added this to the Testnet: Metering Phase milestone Aug 2, 2021
@dckc dckc self-assigned this Aug 2, 2021
@warner
Copy link
Member Author

warner commented Aug 3, 2021

#3488 is about a leak we found in the "vault" loadgen task, however we don't know if it's XS-specific or not. We decided to fix an API usage issue in the treasury contract first, then reevaluate to see if the leak is still present.

I'm going to close this ticket, and open a new one if/when we find a new leak.

@warner warner closed this as completed Aug 3, 2021
warner added a commit that referenced this issue Jun 1, 2024
This "forward to fake zoe" in gc-vat.test was added to demonstrate a
fix for #3482, in which liveslots was mishandling an intermediate
promise by retaining it forever, which made us retain objects that
appear in eventual-send results forever.

This problem was discovered while investigating an unrelated XS engine
bug (#3406), so "is this specific to a single engine?" was on our
mind, and I wasn't sure that we were dealing with two independent bugs
until I wrote the test and showed that it failed on both V8 and XS. So
the test was originally written with a commented-out `managerType:`
option to make it easy to switch back and forth between `local` and
`xs-worker`. That switch was left in the `local` state, probably
because it's slightly faster.

What we've learned is that V8 sometimes holds on to objects despite a
forced GC pass (see #5575 and #3240), and somehow it only seems to
fail in CI runs (and only for people other than me). Our usual
response is to make the test use XS instead of V8, either by setting
`creationOptions.managerType: 'xs-worker'` on the individual vat, or
by setting `defaultManagerType: 'xs-worker'` to set it for all vats.

This PR uses the first approach, changing just the one vat being
exercised (which should be marginally cheaper than making all vats use
XS).

closes #9392
mergify bot added a commit that referenced this issue Jun 1, 2024
)

This "forward to fake zoe" in gc-vat.test was added to demonstrate a fix
for #3482, in which liveslots was mishandling an intermediate promise by
retaining it forever, which made us retain objects that appear in
eventual-send results forever.

This problem was discovered while investigating an unrelated XS engine
bug (#3406), so "is this specific to a single engine?" was on our mind,
and I wasn't sure that we were dealing with two independent bugs until I
wrote the test and showed that it failed on both V8 and XS. So the test
was originally written with a commented-out `managerType:` option to
make it easy to switch back and forth between `local` and `xs-worker`.
That switch was left in the `local` state, probably because it's
slightly faster.

What we've learned is that V8 sometimes holds on to objects despite a
forced GC pass (see #5575 and #3240), and somehow it only seems to fail
in CI runs (and only for people other than me). Our usual response is to
make the test use XS instead of V8, either by setting
`creationOptions.managerType: 'xs-worker'` on the individual vat, or by
setting `defaultManagerType: 'xs-worker'` to set it for all vats.

This PR uses the first approach, changing just the one vat being
exercised (which should be marginally cheaper than making all vats use
XS).

closes #9392
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet xsnap the XS execution tool
Projects
None yet
Development

No branches or pull requests

2 participants