Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(repo): Added panic hooks and reworked graceful shutdown #278

Merged
merged 1 commit into from
Oct 24, 2024

Conversation

0xterminator
Copy link
Contributor

No description provided.

Copy link
Contributor

@lostman lostman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is what we want. Flushing async streams is tricky and we'd be better off assuming the publisher may die in the middle of publishing a block.

I talked to @0xterminator about it yesterday, and I think that the best and simplest thing we can do is this:

  1. Ensure we publish the entire block N before we start publishing N+1.
  2. Re-publish the last block on startup.

Currently, we look up the last published block L and we start at L+1. I have a PR that changes that to restart at L, precisely because the publisher may die before publishing inputs, receipts, and so on.

@@ -301,20 +301,6 @@ impl<S: Streamable> Stream<S> {
}
}

pub async fn flush_await(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

@@ -189,6 +177,26 @@ impl Publisher {
&self.streams
}

fn set_panic_hook(&mut self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overrides the default panic hook, so if the publisher panics, we won't get the panic message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will, the first thing I am doing in printing the stack above!

let nats_client = nats_client.clone();
let fuel_service = Arc::clone(&fuel_service);
handle.spawn(async move {
Publisher::flush_await_all_streams(&nats_client).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing streams after panic is unreliable. async may be in a compromised state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a usual practice to at least attempt some cleanup. Usually if we die, we might just end up in a really bad state. Also , note that both methods flush_await_all_streams and stop_fuel do not throw exceptions, i.e. they cannot fail whatsoever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can do sync cleanup but not grab the Tokio runtime. Here's an example:

use tokio::runtime::Handle;
use std::panic;

fn set_panic_hook() {
    panic::set_hook(Box::new(|panic_info| {
        println!("Panic occurred: {}", panic_info);
        // can panic
        let handle = Handle::current();
        // can panic
        handle.spawn(async {
            println!("Performing async cleanup in panic hook...");
            // can panic
            tokio::time::sleep(tokio::time::Duration::from_secs(1)).await;
            println!("Async cleanup completed.");
        });
    }));
}

#[tokio::main]
async fn main() {
    set_panic_hook();
    panic!("This is a test panic!");
}
Panic occurred: panicked at src/main.rs:27:5:
This is a test panic!
Performing async cleanup in panic hook...
<EOF>

Run it several times and you'll see that you may not even make it to

Performing async cleanup in panic hook...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lostman that happens because the owner process, main dies, right? At that point, we're leveraging the auto-destructor from Tokio -- so graceful shutdown of processes still happens and we simply need our operations to be recoverable from panics.

Copy link
Member

@Jurshsmith Jurshsmith Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more fine-grained, we could use CancellationToken & TaskTracker and similar APIs (like tokio-graceful-shutdown) but since our operations are supposed to be idempotent, I think this solution is quite ergonomic. wdyt?

Copy link
Member

@Jurshsmith Jurshsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏾

@pedronauck pedronauck merged commit 1817350 into main Oct 24, 2024
15 checks passed
@pedronauck pedronauck deleted the feat/eugene/elastic branch October 24, 2024 18:10
lostman added a commit that referenced this pull request Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants