mark some vats as critical, panic the kernel if one dies #4279

warner · 2022-01-11T22:40:20Z

What is the Problem Being Solved?

Currently, the only way for userspace code to panic the kernel is if the bootstrap() message (to the bootstrap vat) rejects its result promise. The kernel panics if a device throws an error, but devices are not really user code. The kernel also has a bunch of invariant checks that can cause a panic, but those are all internal and should not be provokable by userspace code.

However, in any given deployment, there are probably some userspace vats which are considered so critical that when they malfunction, we'd rather have the overall application crash than have the kernel kill off just the faulty vat. The comms and vat-vattp vats always fall into this category. In the Agoric chain, vat-zoe and all the vats that make up the token economy are chain-critical. It will be easier to recover a fault by halting the chain, than to have the kernel delete vat-zoe and reap all of the objects it has ever exported.

So the requirement is for some vats to be marked as "critical". If the kernel ever finds a reason to kill that vat (illegal syscall, metering fault, explicit syscall.terminate), the kernel should panic instead. This should cause controller.run() to reject its return promise instead of fulfilling it, and the host application should react by exiting immediately rather than committing the state changes.

Description of the Design

The vatAdminService~.createVat() call should accept a critical option. If true, the kernel should panic instead of killing this vat. The kernel config object should also accept this flag, for static vats.

This will go into the managerOptions. In kernel.js, when a vat is about to be killed, it should check the flag, and panic() if true (with some details about which vat triggered the panic).

For now, we'll allow any call to createVat to set this flag, meaning that any code which can create vats can also panic the kernel. vat-zoe is the only chain-side vat which gets access to vatAdminService, and the power to panic the kernel is comparable to the authority to create unmetered vats, which vatAdminService also provides. Later, if we find a use case for it, we can have vatAdminService produce less-powerful facets. Or, riffing on an approach @erights suggested for Meters, the service could provide a special criticality object, and the only way to mark a vat as critical is to pass a managerOptions of { critical: criticalityObject }, in a rights-amplification pattern that needs access to both vatAdminService and criticality. But for now, only zoe has access to vatAdminService, so we don't need to build anything too fancy.

Security Considerations

Building a deliberately-crashing vat and marking it critical is a kernel-killing power, so we should be conscious how it is disseminated.

We should think carefully about potential bugs in critical vats and decide whether halting the host application really is the best approach. If multiple services can be severed from one another, then allowing some portion of a chain to keep running even though others are dead might be better. We should war-game a bug in a core service somehow.

Test Plan

Unit tests

The text was updated successfully, but these errors were encountered:

warner · 2022-02-09T21:07:53Z

Assigning to @mhofman , I think this might overlap with pausing vats on meter underflow.

warner · 2022-05-17T22:12:27Z

Moving this to @FUDCo , the crash-or-not aspect is independent of the ability to pause vats.

…ical vat fails Fixes #4279

warner added enhancement New feature or request SwingSet package: SwingSet labels Jan 11, 2022

dckc added the audit-zestival Vulnerability assessment of ERTP + Zoe label Jan 12, 2022

warner added this to the Mainnet: Phase 1 - RUN Protocol milestone Jan 19, 2022

Tartuffo added the MN-1 label Jan 20, 2022

Tartuffo removed the MN-1 label Feb 7, 2022

Tartuffo removed this from the Mainnet: Phase 1 - RUN Protocol milestone Feb 8, 2022

warner assigned mhofman Feb 9, 2022

warner mentioned this issue Feb 9, 2022

tools to debug+resume after kernel panic #1677

Open

warner mentioned this issue Feb 18, 2022

Execute vat creation and root object initialization in a crank #4575

Merged

Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022

warner assigned FUDCo and unassigned mhofman May 17, 2022

FUDCo added a commit that referenced this issue May 26, 2022

feat: allow vats to be marked critical and panic the kernel if a crit…

0da195c

…ical vat fails Fixes #4279

FUDCo mentioned this issue May 26, 2022

feat: allow vats to be marked critical and panic the kernel if a critical vat fails #5436

Merged

FUDCo added a commit that referenced this issue May 26, 2022

feat: allow vats to be marked critical and panic the kernel if a crit…

019b654

…ical vat fails Fixes #4279

FUDCo added a commit that referenced this issue May 26, 2022

feat: allow vats to be marked critical and panic the kernel if a crit…

c99ef9e

…ical vat fails Fixes #4279

FUDCo added a commit that referenced this issue May 27, 2022

feat: allow vats to be marked critical and panic the kernel if a crit…

94e3235

…ical vat fails Fixes #4279

mhofman pushed a commit that referenced this issue May 27, 2022

feat: allow vats to be marked critical and panic the kernel if a crit…

41d1e96

…ical vat fails Fixes #4279

Tartuffo pushed a commit that referenced this issue May 28, 2022

feat: allow vats to be marked critical and panic the kernel if a crit…

9ef4941

…ical vat fails Fixes #4279

mergify bot closed this as completed in #5436 May 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mark some vats as critical, panic the kernel if one dies #4279

mark some vats as critical, panic the kernel if one dies #4279

warner commented Jan 11, 2022

warner commented Feb 9, 2022

warner commented May 17, 2022

mark some vats as critical, panic the kernel if one dies #4279

mark some vats as critical, panic the kernel if one dies #4279

Comments

warner commented Jan 11, 2022

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan

warner commented Feb 9, 2022

warner commented May 17, 2022