Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cargo-shuttle: fix address in use error when service panicked in a previous run #805

Merged
merged 14 commits into from
Apr 27, 2023

Conversation

iulianbarbu
Copy link
Contributor

@iulianbarbu iulianbarbu commented Apr 19, 2023

Description of change

Running user services locally through cargo shuttle run will run a runtime that will load/start a service as a tokio task. If the task panics, the shuttle-runtime will live on because of tokio behavior: tokio-rs/tokio#2002.

This is not the case anymore. I haven't seen this making any difference for our use cases. Removing it to have the bind_panic test succeed.
Extra: set to true the kill_on_drop flag for started shuttle-runtimes to make sure we either .wait() or .kill() on them or if dropped, the kill is automatically issued: https://docs.rs/tokio/latest/tokio/process/struct.Command.html#method.kill_on_drop.

How Has This Been Tested (if applicable)?

Run locally the hello-world-rocket-app example with panic in it. Consequent local runs without the fix hang with the Address in use error. The fix ensures the runtime and provisioner are killed in case the runtime response is not successful.

Also, if cargo-shuttle process receives SIGINT or SIGTERM it's successfully stopping the existing runtimes.

@iulianbarbu iulianbarbu force-pushed the fix/catch_unwind_local_run_errors branch from 64275c4 to b04ff51 Compare April 19, 2023 15:45
@iulianbarbu iulianbarbu requested review from oddgrd and chesedo and removed request for oddgrd April 19, 2023 15:46
@iulianbarbu iulianbarbu self-assigned this Apr 19, 2023
Copy link
Contributor

@oddgrd oddgrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @iulianbarbu! I'm wondering if this test failure in runtime is spurious or due to the change in proto, did you see it fail locally too?

cargo-shuttle/src/lib.rs Outdated Show resolved Hide resolved
@iulianbarbu
Copy link
Contributor Author

Thanks @iulianbarbu! I'm wondering if this test failure in runtime is spurious or due to the change in proto, did you see it fail locally too?

I am curious if they will fail again now. I am running them now locally as well.

@iulianbarbu
Copy link
Contributor Author

iulianbarbu commented Apr 20, 2023

It seems like the cargo-shuttle and shuttle-runtime tests fail again on the CI.

However, on my machine, the cargo-shuttle related ones pass. It's worth mentioning that running the tests as the CI does ends with the below message for cargo-shuttle. However, I don't see any failed tests in the output.

error: test failed, to rerun pass `-p cargo-shuttle --test integration`

Caused by:
  process didn't exit successfully: `/Users/iulianbarbu/repos/shuttle/target/debug/deps/integration-9fe6044d5f6f7b77 --nocapture` (exit status: 1)

The shuttle-runtime bind_panic fails it seems because of the kill_on_drop flag I added. I'll investigate if removing it is not doing a regression to the fix, because we might not need it.

@iulianbarbu
Copy link
Contributor Author

iulianbarbu commented Apr 25, 2023

@oddgrd @chesedo , it seems like the previously failing tests pass now but this time shuttle-deployer tests failed. I think this is a spurious one because for the previous commit it was successful and the last commit is only comments related. I think this is mergeable besides improving a bit how the code looks.

Copy link
Contributor

@chesedo chesedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems about good to go.

Comment on lines 781 to 796
if signal_received {
let stop_request = StopRequest {};
trace!(
?stop_request,
"stopping service because it received a signal"
);
let _response =
rt.1.stop(tonic::Request::new(stop_request))
.or_else(|err| async {
provisioner_server.abort();
rt.0.kill().await?;
Err(err)
})
.await?
.into_inner();
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we not already handled this case above? Ie if signal_received was false above then it will still be false here

Copy link
Contributor Author

@iulianbarbu iulianbarbu Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous singal_received handling refers to receiving a signal during the runtimes initializations. If signal_received is true it should stop all the existing runtimes and return.

Otherwise, it moves to the logic of waiting on each runtime in a for loop, namely this case we're discussing. While waiting on these runtimes we must handle signals as well, hence we're checking the signal_received before actually waiting on a runtime in a new iteration. If it's true, then we've received a signal in a previous iteration and we must stop all the runtimes which we haven't waited on instead on continuing to wait on them.

runtime/src/alpha/mod.rs Outdated Show resolved Hide resolved
@iulianbarbu iulianbarbu force-pushed the fix/catch_unwind_local_run_errors branch 3 times, most recently from 6b5baa5 to a501e6c Compare April 25, 2023 10:35
@iulianbarbu iulianbarbu changed the title cargo-shuttle(local_run): fix address in use error when service panicked in a previous run cargo-shuttle: fix address in use error when service panicked in a previous run Apr 25, 2023
Copy link
Contributor

@oddgrd oddgrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring this out Iulian, I understand why they have a warning about this being tricky in the tokio::signal docs 😄 I left a few comments but mostly nits, and a note about us considering doing this for windows too in the future.

We might also want to test this on windows to see that it doesn't cause any regressions, IIRC importing something that's only available on unix may be an issue. But I can check this quickly tomorrow. Edit: It looks like all uses of unix signals is inside cfg(unix), so this should be fine.

cargo-shuttle/src/lib.rs Outdated Show resolved Hide resolved
cargo-shuttle/src/lib.rs Show resolved Hide resolved
cargo-shuttle/src/lib.rs Outdated Show resolved Hide resolved
if !response.success {
error!(error = response.message, "failed to load your service");
exit(1);
// If prior signal received is set to true we must stop all the existing runtimes and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this loop stop the services (unless the stop request fails), and then the runtimes are stopped on return (drop)? I may be misreading this 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, in a way, only that runtimes being dropped doesn't mean necessarily that the underlying processes are killed (https://docs.rs/tokio/latest/tokio/process/struct.Child.html#caveats). What I wanted to say is that during the previous loop of initializing the runtimes if a signal is received then we should stop the existing runtimes (in a graceful way, through the StopRequest or kill them only if that fails) and then we can return from the local_run as an abrupt termination.

Copy link
Contributor Author

@iulianbarbu iulianbarbu Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An addendum for the above comment: I had to adjust a bit this logic. Now, if a service panics during initialization then all the workspace runtimes must be killed and cargo-shuttle exits with code 1. Because we forcefully exit, the destructors for the runtime are not called and we must forcefully kill the runtime besides suggesting it to stop in a graceful way. The rest still applies.

@iulianbarbu iulianbarbu force-pushed the fix/catch_unwind_local_run_errors branch from a501e6c to 3a7d277 Compare April 25, 2023 19:38
@iulianbarbu iulianbarbu force-pushed the fix/catch_unwind_local_run_errors branch from e6eb1fb to 3ee158d Compare April 26, 2023 06:13
@iulianbarbu iulianbarbu force-pushed the fix/catch_unwind_local_run_errors branch from 4351ebb to 3bf6a0d Compare April 26, 2023 13:05
@@ -657,6 +657,7 @@ impl Shuttle {
.stop(tonic::Request::new(stop_request))
.or_else(|err| async {
runtime.kill().await?;
error!(status = ?err, "killed the runtime by force because stopping it errored out");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we do trace! here, then the user won't see an error (which won't be relevant to them in the event of CTRL C since as far as they are concerned it was a success, not an error, right?)?

@oddgrd oddgrd merged commit 458cf25 into shuttle-hq:main Apr 27, 2023
oddgrd pushed a commit that referenced this pull request Apr 27, 2023
…evious run (#805)

* runtime & proto: teardown idle runtime in case...

...of service panics.

* cargo-shuttle: abort all runtimes when one fails during a...

...local run.

* cargo-shuttle: added sigint and sigterm handlers

* proto: remove kill_on_drop flag

* cargo-shuttle: fix comments

* runtime: remove leftovers

* cargo-shuttle: extract stop_runtime logic in a function

* proto: add kill_on_drop flag again

* cargo-shuttle: fix again one service panic case

* runtime(tests): fix bind_panic

* cargo-shuttle: handle stop requests error gracefully

* cargo-shuttle: handle stopping error gracefully...

...for service that panics

* cargo-shuttle: clarified logging in case runtime stopping errors

* cargo-shuttle: change log level
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants