Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch panics in Non-tx related code #1305

Open
1 of 3 tasks
ValarDragon opened this issue Apr 20, 2022 · 3 comments
Open
1 of 3 tasks

Catch panics in Non-tx related code #1305

ValarDragon opened this issue Apr 20, 2022 · 3 comments

Comments

@ValarDragon
Copy link
Member

ValarDragon commented Apr 20, 2022

At the moment, any panic in BeginBlock, or EndBlock will halt the chain. This by proxy means that any panic in epoch logic will halt the chain. Only panics inside of transactions get caught.

The causes of errors that will lead to panics in BeginBlock/EndBlock should not all be assumed to be chain halt worthy. In fact a lot may stem from edge cases as more automated chain logic is added, that should just be reported but not halt the chain.

There are some cases where the intent may explicitly be for a panic to halt the chain. In the future, I think there should be a sentinel prefix, which if a panic message starts with leads to chain halt.

I propose that we:

  • Catch panics in epoch hooks, and if a single epoch hook logic fails, revert its state change, and then proceed to the next hook
  • Catch panics in beginblock and endblock,
  • Make a 'signal code' that panics can use to indicate halting anyway.

One thing we need to be cognizant of, is that this doesn't slow down epoch processing due to: cosmos/cosmos-sdk#10310 . It is unclear to what extent that may be a concern here, after Roman's IAVL improvements, and should be empirically measured.

@p0mvn
Copy link
Member

p0mvn commented Apr 21, 2022

There are some cases where the intent may explicitly be for a panic to halt the chain. In the future, I think there should be a sentinel prefix, which if a panic message starts with leads to chain halt.

How are we planning to distinguish between what's panic worthy and what may be skipped before we get to the sentinel prefix?

@alexanderbez
Copy link
Contributor

Catch panics in epoch hooks, and if a single epoch hook logic fails, revert its state change, and then proceed to the next hook

What's the recovery mechanism for failed hook executions? I.e. what is the procedure to investigate, understand and potentially fix them? Or will they just fail silently?

@ValarDragon
Copy link
Member Author

How are we planning to distinguish between what's panic worthy and what may be skipped before we get to the sentinel prefix?

Since the majority of panics are unanticipated, I propose that the way this would work is a given module would decide that I want any panic within this module to be state machine halting. (And the default be for module panics to not cause state machine halts)

If thats deemed as fine, then a given module can wrap its panics if it wants to make them state machine halting, e.g. https://play.golang.com/p/ZUFo0KqevWJ

We'd expose a function like that in osmoutils that adds the sentinel prefix to a panic message

What's the recovery mechanism for failed hook executions? I.e. what is the procedure to investigate, understand and potentially fix them? Or will they just fail silently?

A failed hook execution will currently log as an error in a node. Hopefully theres a process for errors observed in full nodes in production to surface to github issues. Though this may not be the case due to noise from p2p logs. (Has there been any work Tendemrint side to combine these logs, or separate them per reactor into different files?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Needs Triage 🔍
Development

No branches or pull requests

3 participants