-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(repo): Added panic hooks and reworked graceful shutdown #278
Conversation
322efa4
to
4292875
Compare
4292875
to
85e18c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is what we want. Flushing async streams is tricky and we'd be better off assuming the publisher may die in the middle of publishing a block.
I talked to @0xterminator about it yesterday, and I think that the best and simplest thing we can do is this:
- Ensure we publish the entire block N before we start publishing N+1.
- Re-publish the last block on startup.
Currently, we look up the last published block L and we start at L+1. I have a PR that changes that to restart at L, precisely because the publisher may die before publishing inputs, receipts, and so on.
@@ -301,20 +301,6 @@ impl<S: Streamable> Stream<S> { | |||
} | |||
} | |||
|
|||
pub async fn flush_await( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx
@@ -189,6 +177,26 @@ impl Publisher { | |||
&self.streams | |||
} | |||
|
|||
fn set_panic_hook(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overrides the default panic hook, so if the publisher panics, we won't get the panic message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will, the first thing I am doing in printing the stack above!
let nats_client = nats_client.clone(); | ||
let fuel_service = Arc::clone(&fuel_service); | ||
handle.spawn(async move { | ||
Publisher::flush_await_all_streams(&nats_client).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing streams after panic is unreliable. async may be in a compromised state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a usual practice to at least attempt some cleanup. Usually if we die, we might just end up in a really bad state. Also , note that both methods flush_await_all_streams and stop_fuel do not throw exceptions, i.e. they cannot fail whatsoever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you can do sync cleanup but not grab the Tokio runtime. Here's an example:
use tokio::runtime::Handle;
use std::panic;
fn set_panic_hook() {
panic::set_hook(Box::new(|panic_info| {
println!("Panic occurred: {}", panic_info);
// can panic
let handle = Handle::current();
// can panic
handle.spawn(async {
println!("Performing async cleanup in panic hook...");
// can panic
tokio::time::sleep(tokio::time::Duration::from_secs(1)).await;
println!("Async cleanup completed.");
});
}));
}
#[tokio::main]
async fn main() {
set_panic_hook();
panic!("This is a test panic!");
}
Panic occurred: panicked at src/main.rs:27:5:
This is a test panic!
Performing async cleanup in panic hook...
<EOF>
Run it several times and you'll see that you may not even make it to
Performing async cleanup in panic hook...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lostman that happens because the owner process, main
dies, right? At that point, we're leveraging the auto-destructor from Tokio -- so graceful shutdown of processes still happens and we simply need our operations to be recoverable from panics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it more fine-grained, we could use CancellationToken & TaskTracker and similar APIs (like tokio-graceful-shutdown) but since our operations are supposed to be idempotent, I think this solution is quite ergonomic. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍🏾
No description provided.