Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mark some vats as critical, panic the kernel if one dies #4279

Closed
warner opened this issue Jan 11, 2022 · 2 comments · Fixed by #5436
Closed

mark some vats as critical, panic the kernel if one dies #4279

warner opened this issue Jan 11, 2022 · 2 comments · Fixed by #5436
Assignees
Labels
audit-zestival Vulnerability assessment of ERTP + Zoe enhancement New feature or request SwingSet package: SwingSet
Milestone

Comments

@warner
Copy link
Member

warner commented Jan 11, 2022

What is the Problem Being Solved?

Currently, the only way for userspace code to panic the kernel is if the bootstrap() message (to the bootstrap vat) rejects its result promise. The kernel panics if a device throws an error, but devices are not really user code. The kernel also has a bunch of invariant checks that can cause a panic, but those are all internal and should not be provokable by userspace code.

However, in any given deployment, there are probably some userspace vats which are considered so critical that when they malfunction, we'd rather have the overall application crash than have the kernel kill off just the faulty vat. The comms and vat-vattp vats always fall into this category. In the Agoric chain, vat-zoe and all the vats that make up the token economy are chain-critical. It will be easier to recover a fault by halting the chain, than to have the kernel delete vat-zoe and reap all of the objects it has ever exported.

So the requirement is for some vats to be marked as "critical". If the kernel ever finds a reason to kill that vat (illegal syscall, metering fault, explicit syscall.terminate), the kernel should panic instead. This should cause controller.run() to reject its return promise instead of fulfilling it, and the host application should react by exiting immediately rather than committing the state changes.

Description of the Design

The vatAdminService~.createVat() call should accept a critical option. If true, the kernel should panic instead of killing this vat. The kernel config object should also accept this flag, for static vats.

This will go into the managerOptions. In kernel.js, when a vat is about to be killed, it should check the flag, and panic() if true (with some details about which vat triggered the panic).

For now, we'll allow any call to createVat to set this flag, meaning that any code which can create vats can also panic the kernel. vat-zoe is the only chain-side vat which gets access to vatAdminService, and the power to panic the kernel is comparable to the authority to create unmetered vats, which vatAdminService also provides. Later, if we find a use case for it, we can have vatAdminService produce less-powerful facets. Or, riffing on an approach @erights suggested for Meters, the service could provide a special criticality object, and the only way to mark a vat as critical is to pass a managerOptions of { critical: criticalityObject }, in a rights-amplification pattern that needs access to both vatAdminService and criticality. But for now, only zoe has access to vatAdminService, so we don't need to build anything too fancy.

Security Considerations

Building a deliberately-crashing vat and marking it critical is a kernel-killing power, so we should be conscious how it is disseminated.

We should think carefully about potential bugs in critical vats and decide whether halting the host application really is the best approach. If multiple services can be severed from one another, then allowing some portion of a chain to keep running even though others are dead might be better. We should war-game a bug in a core service somehow.

Test Plan

Unit tests

@warner warner added enhancement New feature or request SwingSet package: SwingSet labels Jan 11, 2022
@dckc dckc added the audit-zestival Vulnerability assessment of ERTP + Zoe label Jan 12, 2022
@warner warner added this to the Mainnet: Phase 1 - RUN Protocol milestone Jan 19, 2022
@Tartuffo Tartuffo added the MN-1 label Jan 20, 2022
@Tartuffo Tartuffo removed the MN-1 label Feb 7, 2022
@Tartuffo Tartuffo removed this from the Mainnet: Phase 1 - RUN Protocol milestone Feb 8, 2022
@warner
Copy link
Member Author

warner commented Feb 9, 2022

Assigning to @mhofman , I think this might overlap with pausing vats on meter underflow.

@warner
Copy link
Member Author

warner commented May 17, 2022

Moving this to @FUDCo , the crash-or-not aspect is independent of the ability to pause vats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit-zestival Vulnerability assessment of ERTP + Zoe enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
5 participants