-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tools to debug+resume after kernel panic #1677
Comments
assigning to Dean to let him explain this better |
@dtribble explained more to us, and there are multiple components. Imagine you're a validator, you're churning away on the chain, and then suddenly your node halts with a big ugly "kernel panic" error message. You chat with other validators and discover they're all having the same problem. Now, how do you proceed? The first component is how to identify what went wrong. You collectively look at the #3742 flight recorder (slog data) and see that crank 7890 caused delivery 1234 to vat56 to get started, but the kernel paniced before it finished. Vats aren't supposed to be able to make the kernel panic (modulo #4279), so this is by definition a kernel bug. Delivery 1234 was the immediate trigger, and it perform a syscall that provoked a kernel invariant failure. The second component is how to debug this. You'd like to get the kernel under a debugger as it handles that delivery and poke around. So you'd like to be able to run a copy of the chain locally, with instructions to Now assume that the community has gone through this debugging process and understands the problem. At this point, they must decide on the best course of action. The committed history is currently all blocks up-to-but-excluding the one that contained the fatal delivery. That history includes all the cranks and deliveries from those blocks, and the contents of the run-queue from the end of that block. It's unlikely that we would modify history/ETC to resolve the problem, so there a few likely approaches to pursue:
So the third component is: assuming the community has decided on one of these actions, how will the validators execute it? This is the most severe form of governance: validator software override. We can't really perform a governance vote now because the chain has halted (although #4516 explores an alternative). At this point the chain has halted, so all validators are eagerly standing by to take recovery action. What do we tell them to type? If the decision is to skip a particular delivery or kill a particular vat in lieu of processing a particular delivery, then we've got a pair of numbers to get into the kernel. We can add a Then we could introduce some sort of config file to cosmic swingset that would read skip/terminate/replace-kernel directives from the file, and submit them to swingset. It would do this on each node restart, rather than being driven through transaction messages. If we had those pieces, then our instructions to validators would be:
Their node would start up, resume executing transactions from the beginning of the most recent block, then swingset would get to the designated cranknum/deliverynum and perform the alternate action. If the action was to kill the vat, all validators would see the vat being killed (in consensus), and the kernel bug would not be triggered. If the action was to skip the delivery, all validators would skip the delivery, and the kernel bug would not be triggered. If the action was to replace the kernel, the validator would use the controller API to replace the kernel bundle before starting to execute the block, the triggering delivery would be allowed to go through, and the fixed kernel would not suffer the bug. The code that skips a delivery based on a config file would look a lot like the code that calls So the tasks are:
There are extensions to this which we're not going to pursue right now. The simplest would be: if the kernel crashes during a particular delivery, automatically configure a "terminate vat before delivery NNN" and restart the validator.
Doing that approach would maximize the chances that the chain proceeds forwards, but I think it would increase the chances of divergence and confusion. Death before confusion. |
I'm sizing this as a 3: 1 for the kernel code that implements the API and |
@warner What does Michael need to do for this issue? Something related to config files? |
At the meeting, @dtribble suggested:
I'm still confused by that suggestion. It might mean we should use a
debugger()
statement to pop out to a debugger, but then we aren't really terminating the vat, we're just adding a breakpoint that fires under some particular condition (e.g. we're about to replay transcript entry N). Maybe it implies a "pause vat" feature that we didn't talk about: instead of terminating the vat, we just want to not deliver messages to it for a while (but retain the option to resume delivering them again in the future). To implement this, I think we'd need to add a new "pause queue". Each time we pull a message off the run-queue and see that it's destined for a paused Vat, we append it to the pause-queue instead of delivering it.I don't know what consequences these new message-ordering rules might have, nor where the authority to pause and resume a Vat should be held. I'm pretty sure a paused vat should retain all its normal references, so paused vats are very different than terminated vats (which the kernel should be able to forget about utterly).
Originally posted by @warner in #514 (comment)
The text was updated successfully, but these errors were encountered: