-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make post-run async and add retry logic #6705
Conversation
@@ -363,6 +366,7 @@ impl Service { | |||
pub fn stop(&mut self, | |||
shutdown_config: ShutdownConfig) | |||
-> impl Future<Item = (), Error = Error> { | |||
self.stop_post_run(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding a comment about why we're doing this here. Stopping health checks is (hopefully!) self-explanatory, since they're running all the time, but post-run
is a tad different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name stop_post_run
makes it clear what is happening, so I think the comment here should be focused on why that's necessary.
// TODO (DM): Make `HookRunner` and this method allow arbirary configuration of how to run a | ||
// hook. For example, a configurable backoff, timeout, or failure tracking. It would also be | ||
// nice to merge this with the healthcheck hook retry logic. | ||
pub fn loop_future(&self) -> (FutureHandle, impl Future<Item = (), Error = ()>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might call this something like retriable_future
to capture why it's looping.
match exit_value.0 { | ||
0 => false, | ||
1 => true, | ||
_ => false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reading this code, I'd expect this case to be true
... however, reading the documentation below indicates that any non-zero, non-one exit code means "fail without retrying"... having a comment about that at this code will be useful for anyone that comes through to work on this code in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what should be done here from a product perspective. I can think of three potential, reasonable options:
Option 1
- 0 no retry
- else retry
Option 2
- 0 or 1 no retry
- else retry
Option 3
- 1 retry
- else no retry
There are obviously infinite other possibilities. Is there currently a way for the user to access the exit code of the post-run
hook and use it? If not, and the exit code is solely used to determine retry behaviour, I now think I would change this to the simpler Option 1. Are there preferences for a particular set of logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't currently any mechanism by which a user can access the exit code of a hook and influence what is done as a result. Instead, each hook defines what effect the exit status will have in Hook::handle_exit
. That in turn converts the status into the ExitValue
type for the hook... in most cases, I don't think anything is really done with that value, but some hooks like health-check
and suitability
do make use of it.
I think option 1 above is fine.
As an aside, now that I'm looking a bit more closely, I'm not entirely sure what the benefit of using our own ExitCode
for an ExitValue
is over std::process::ExitStatus
. 🤔
If retrying hooks is a general thing we will end up wanting to do, we might consider not having a Hook::retry
function, but instead modify the handle_exit
function of retriable hooks to convert an ExitStatus
into a new RetryDisposition
(naming is hard) type that could be used as an ExitValue
. This could then be used by what's calling it to figure out what to do next. All that could come later though (if at all).
@@ -131,6 +131,8 @@ The post run hook will get executed after initial startup. | |||
|
|||
For many data services creation of specific users / roles or datastores is required. This needs to happen once the service has already started. | |||
|
|||
The retry behavior of this hook is determined by its exit code. Exit code `0` indicates success, and the hook will not be run again. Exit code `1` indicates a failure with a retry. The `post-run` hook will immediately be executed again. Continually exit with `1` to keep retrying the `post-run` hook. Any other exit code indicates failure without a retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth mentioning that the service will continue running the post-run hook exits with a "fail but don't restart" code. This is what it did before (i.e., if the post-run hook failed, that'd be it and the service would go on its merry way), but it would be good to have that explicitly documented.
Long-term we'll want to think about whether or not this is the "correct" behavior, but that brings to mind Erlang-style restart directives and bigger questions that I think would be best to address deliberately and separately.
/// Consumes the handle to that future in the process. | ||
fn stop_post_run(&mut self) { | ||
if let Some(h) = self.post_run_handle.take() { | ||
h.terminate(); | ||
} | ||
} | ||
|
||
// This hook method looks different from all the others because | ||
// it's the only one that runs async right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably remove this particular comment, since it's no longer the only async hook 😂
hook_runner.clone() | ||
.into_future() | ||
.map(move |(exit_value, _duration)| { | ||
if hook_runner.hook.retry(exit_value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you added a retry
function to the Hook
trait to allow for the (temporal) future where we can start to generalize this kind of hook behavior, rather than tightly-coupling this to the post-run
hook specifically. We don't lose anything, but we have a nice foundation from which to build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to add some logging here (probably debug!
-level) indicating that we're rerunning the offending hook.
@@ -285,6 +285,8 @@ pub trait Hook: fmt::Debug + Sized + Send { | |||
status: ExitStatus) | |||
-> Self::ExitValue; | |||
|
|||
fn retry(&self, _exit_value: Self::ExitValue) -> bool { false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always good to add some documentation comments 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. All the comments are minor tweaks that you can take or leave. With the exception of the TODO, which should be addressed before merge and is why this review requests changes.
/// Consumes the handle to that future in the process. | ||
fn stop_post_run(&mut self) { | ||
if let Some(h) = self.post_run_handle.take() { | ||
h.terminate(); | ||
} | ||
} | ||
|
||
// This hook method looks different from all the others because | ||
// it's the only one that runs async right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't true any longer, is it?
/// Stop the `post-run` future. The `post-run` hook has retry logic based on its exit code. This | ||
/// will stop this retry loop regardless of `post-run`'s exit code. | ||
/// | ||
/// Consumes the handle to that future in the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to indicate that in the name of the method rather than the comment? Or maybe a more interesting question is: does the caller of the method need to know this?
} | ||
|
||
/// Stop the `post-run` future. The `post-run` hook has retry logic based on its exit code. This | ||
/// will stop this retry loop regardless of `post-run`'s exit code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd hesitate to comment about the behavior of other code in comments, as they're extra likely to become incorrect. See the comment on post_stop
for instance. With the second sentence removed, I think this comment is great.
|
||
// TODO (DM): Make `HookRunner` and this method allow arbirary configuration of how to run a | ||
// hook. For example, a configurable backoff, timeout, or failure tracking. It would also be | ||
// nice to merge this with the healthcheck hook retry logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note to make sure this TODO is resolved before merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, this sounds a bit like YAGNI. Do we have a use case that demands this today? If not, I'd leave it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I will remove this comment. I wanted a way to signify to reviewers that this is where we could add this functionality and get their feedback. In that way the comment has served its purpose 😄 . Maybe a better method would have been to just leave a comment on the PR?
I do think it would be useful to merge this logic with the health check retry logic, and there is an issue about adding an initial delay before starting a service and many of those other features are currently TODOs in other parts of the code. Ultimatly, I think we will end up implementing some of this logic, but it is outside the scope of this PR.
0 => false, | ||
1 => true, | ||
_ => false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not
exit_value.0 == 1
?
Even better, define a constant (at the fn
or type level, either seems good) and you get:
exit_value == SHOULD_RETRY
You'll have to derive PartialEq
for ExitCode
, but that seems worth it to me.
Also, unrelated, if feels like this method would more accurately be named should_retry
.
27439bb
to
43aa4a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but I think we need to ensure that the post-run hook is definitely stopped when we detach the Supervisor (otherwise, we stand to be back in the same situation that required #6717).
That raises an interesting question, though... if a post-run hook is failing (and thus re-running itself) when the Supervisor goes down for an update, do we restart it when the Supervisor comes back up, as we do with health checks? If so, then we'd need to somehow know which post-runs were failing and only bring them back up (otherwise, we could restart all post-run hooks, even ones that had successfully completed).
If we expect post-run hooks to be idempotent, that's probably OK, but since post-run hooks have only ever been run once until this point, I'm not sure how valid that is in practice (though I would hope it would be).
Alternatively, we could just not restart post-run hooks when the Supervisor reattaches to a service. That would certainly be the most straightforward to implement (we wouldn't need to persist some kind of state across Supervisor restarts). Perhaps that's what we do for now, with the potential to follow up in the future if necessary. Any other thoughts? @davidMcneil @baumanj @raskchanky
At the very least, we should ensure that the post-run future is stopped when we detach.
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
I'm generally in favor of doing the simplest thing/YAGNI. The only thing to consider is whether it would be hard to change later. What's the impact on folks who get accustomed to it not restarting |
@@ -363,6 +366,7 @@ impl Service { | |||
pub fn stop(&mut self, | |||
shutdown_config: ShutdownConfig) | |||
-> impl Future<Item = (), Error = Error> { | |||
self.stop_post_run(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name stop_post_run
makes it clear what is happening, so I think the comment here should be focused on why that's necessary.
@@ -40,7 +57,26 @@ impl<H> HookRunner<H> where H: Hook + Sync | |||
pkg, | |||
passwd } | |||
} | |||
|
|||
pub fn retriable_future(&self) -> (FutureHandle, impl Future<Item = (), Error = ()>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"retriable" is a totally valid spelling, though it might be worth considering using "retryable", so it turns up in a search for "retry"
hook_runner.clone() | ||
.into_future() | ||
.map(move |(exit_value, _duration)| { | ||
if <H as Hook>::should_retry(&exit_value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
if <H as Hook>::should_retry(&exit_value) { | |
if H::should_retry(&exit_value) { |
? It seems to work. Similarly with file_name
. If you think H
is unclear, it's fine to be explicit and call it HOOK
. There's no need for template variables to be limited to single characters.
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
43aa4a8
to
bf295ea
Compare
This should add the logic to detach the I snuck two refactoring commits on top of this change that are related but not core to the issue. I will happily revert them before merging if they are unhelpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks also for adding the follow-up issue for #6739.
|
||
/// Return a future that will shut down a service, performing any | ||
/// necessary cleanup, and run its post-stop hook, if any. | ||
pub fn stop(&mut self, | ||
shutdown_config: ShutdownConfig) | ||
-> impl Future<Item = (), Error = Error> { | ||
self.stop_health_checks(); | ||
self.detach(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
if self.check_process() { | ||
// If the service is not initialized and the process is still running, the Supervisor | ||
// was restarted and we just have to reattach to the process. | ||
if up { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a welcome refactoring 👍
Resolves #2364 #5957
Signed-off-by: David McNeil mcneil.david2@gmail.com