Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zoe will be confused by vat-vat-admin upgrade cancelling the 'done' promise #8263

Closed
warner opened this issue Aug 28, 2023 · 5 comments · Fixed by #8453
Closed

zoe will be confused by vat-vat-admin upgrade cancelling the 'done' promise #8263

warner opened this issue Aug 28, 2023 · 5 comments · Fixed by #8453
Assignees
Labels
bug Something isn't working Zoe package: Zoe

Comments

@warner
Copy link
Member

warner commented Aug 28, 2023

What is the Problem Being Solved?

When a parent vat uses E(vatAdminService).createVat() to create a new vat, one thing it gets back is an adminNode: the "control facet" for the new vat. If you call E(adminNode).done(), the result Promise you get back will not generally fire until the new vat is upgraded.

However, vat-vat-admin could get upgraded, and all Promises decided by an upgraded vat are cancelled. This results in a distinctive rejection of the promise (the rejection value is an object, with { name: 'vatUpgraded' } and some other properties), which can be recognized by the isUpgradeDisconnection function from @agoric/internal.

While planning changes that would require vat-vat-admin to be upgraded, we realized that the currently-deployed Zoe will probably react to a vat-vat-admin upgrade as if the child vat had terminated, because the code that follows the .done() promise does not use isUpgradeDisconnection:

E.when(
E(adminNode).done(),
completion => {
instanceAdmin.exitAllSeats(completion);
},
reason => instanceAdmin.failAllSeats(reason),
);

Description of the Design

The zoe code needs to examine the rejection result using isUpgradeDisconnection(reason). If true, then vat-vat-admin was upgraded, and zoe needs to fetch a new done promise (by calling E(adminNode).done()) and attach the same sort of callback. The original uses E.when() so I'm not quite sure of the sequence, but if it were using .then, the fixed version might look like:

function reallyDone(adminNode) {
  return E(adminNode).done()
    .catch(err => isUpgradeDisconnection(err) ? reallyDone(adminNode) : throw(err));
 }

reallyDone(adminNode)
   .then(completion => instanceAdmin.exitAllSeats(completion),
         reason => instanceAdmin.failAllSeats(reason));

Security Considerations

I know we use E.when as a guard against malicious promise-like objects with a fake .then method, and I wouldn't want to reduce our defensiveness by adding a .then, please adapt the code above to the local norms.

Contract vats which call vatPowers.exitVatWithFailure(makeUpgradeDisconnection()) could confuse this code into an infinite loop, or maybe not, depending upon what adminNode.done() does on a terminated vat: I think the second call might trigger a different kind of error.

Scaling Considerations

Test Plan

We need Zoe unit tests that simulate the done() promise being abandoned, to ensure that Zoe will react properly. We also need Zoe upgrade tests to ensure that Zoe re-acquires the promise after an upgrade, or that its use of a DurablePromiseWatcher (see below) is doing the right thing.

Upgrade Considerations

The deployed Zoe has the old code, and will react the bad way unless/until Zoe is upgraded. So we must defer any upgrades of vat-vat-admin until after a zoe upgrade that fixes this issue.

The zoe code that calls E(adminNode).done() lives in startInstance, which would not be called again during a Zoe upgrade. Unless we add something that re-runs E(adminNode).done() on all existing contract instances during a Zoe upgrade, I think a Zoe upgrade will basically forget the old handler, so subsequent contract vat terminations will just be ignored. That's probably bad for cleanup.

It may be a good idea to have Zoe use globalThis.VatData.providePromiseWatcher() to establish a durable watcher on the done Promise. That would remove the need to re-fetch a done promise when zoe is upgraded, however we would still need the isUpgradedDisconnection() check to guard against vat-vat-admin being upgraded.

cc @Chris-Hibbert to think about the Zoe changes
cc @mhofman to help me remember that we cannot upgrade vat-vat-admin until a Zoe fix+upgrade is deployed

@warner warner added bug Something isn't working Zoe package: Zoe labels Aug 28, 2023
@Chris-Hibbert Chris-Hibbert self-assigned this Aug 28, 2023
@mhofman
Copy link
Member

mhofman commented Aug 28, 2023

How does it even work today across Zoe upgrades? E.when() registers a heap reaction, which would disappear as soon as Zoe upgrades.

@turadg turadg added this to the Upgrade 13 milestone Aug 28, 2023
@mhofman
Copy link
Member

mhofman commented Aug 28, 2023

And yeah I'm really concerned about simply tail recurring with reallyDone. AFAIK, there is a counter on that upgrade rejection object exactly for those reasons.

@warner
Copy link
Member Author

warner commented Aug 28, 2023

How does it even work today across Zoe upgrades? E.when() registers a heap reaction, which would disappear as soon as Zoe upgrades.

It does not, that's what I tried to capture in:

The zoe code that calls E(adminNode).done() lives in startInstance, which would not be called again during a Zoe upgrade. Unless we add something that re-runs E(adminNode).done() on all existing contract instances during a Zoe upgrade, I think a Zoe upgrade will basically forget the old handler, so subsequent contract vat terminations will just be ignored. That's probably bad for cleanup.

@mhofman
Copy link
Member

mhofman commented Aug 28, 2023

My bad, I hadn't gotten to that part yet. That feels like a separate but related issue. Effectively Zoe upgrades don't handle vat termination cleanup today already.

@warner
Copy link
Member Author

warner commented Sep 25, 2023

I tested this more directly with my "ghost-replay" tool (not yet landed). It shows that the upgrade11 version of zoe will react to a cancelled promise (like kp146) by setting state.acceptingOffers = false on some durable object (not sure which, but I imagine it's obvious in the code somewhere).

% node src/run-ghost.js swing-store cancel-promise kp146
...
% cat vatstore.out
## v9.vom.o+d33/2
 - state.acceptingOffers:
   old: body: "#true"   slots:
   new: body: "#false"   slots:

Looking at the slogfile from this test, I see this state.acceptingOffers change, and something iterates through a collection (finding nothing), and no other syscalls. So I think whatever vat kp146 represents (the iface of the public facet is "BLD issuer", in case that helps) does not have any open offers/seats/etc that need to be closed out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Zoe package: Zoe
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants