Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make post-run async and add retry logic #6705

Merged
merged 6 commits into from
Jul 29, 2019

Conversation

davidMcneil
Copy link
Contributor

Resolves #2364 #5957

Signed-off-by: David McNeil mcneil.david2@gmail.com

@@ -363,6 +366,7 @@ impl Service {
pub fn stop(&mut self,
shutdown_config: ShutdownConfig)
-> impl Future<Item = (), Error = Error> {
self.stop_post_run();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a comment about why we're doing this here. Stopping health checks is (hopefully!) self-explanatory, since they're running all the time, but post-run is a tad different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name stop_post_run makes it clear what is happening, so I think the comment here should be focused on why that's necessary.

components/sup/src/manager/service.rs Show resolved Hide resolved
// TODO (DM): Make `HookRunner` and this method allow arbirary configuration of how to run a
// hook. For example, a configurable backoff, timeout, or failure tracking. It would also be
// nice to merge this with the healthcheck hook retry logic.
pub fn loop_future(&self) -> (FutureHandle, impl Future<Item = (), Error = ()>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might call this something like retriable_future to capture why it's looping.

match exit_value.0 {
0 => false,
1 => true,
_ => false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reading this code, I'd expect this case to be true... however, reading the documentation below indicates that any non-zero, non-one exit code means "fail without retrying"... having a comment about that at this code will be useful for anyone that comes through to work on this code in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what should be done here from a product perspective. I can think of three potential, reasonable options:

Option 1

  • 0 no retry
  • else retry

Option 2

  • 0 or 1 no retry
  • else retry

Option 3

  • 1 retry
  • else no retry

There are obviously infinite other possibilities. Is there currently a way for the user to access the exit code of the post-run hook and use it? If not, and the exit code is solely used to determine retry behaviour, I now think I would change this to the simpler Option 1. Are there preferences for a particular set of logic here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't currently any mechanism by which a user can access the exit code of a hook and influence what is done as a result. Instead, each hook defines what effect the exit status will have in Hook::handle_exit. That in turn converts the status into the ExitValue type for the hook... in most cases, I don't think anything is really done with that value, but some hooks like health-check and suitability do make use of it.

I think option 1 above is fine.

As an aside, now that I'm looking a bit more closely, I'm not entirely sure what the benefit of using our own ExitCode for an ExitValue is over std::process::ExitStatus. 🤔

If retrying hooks is a general thing we will end up wanting to do, we might consider not having a Hook::retry function, but instead modify the handle_exit function of retriable hooks to convert an ExitStatus into a new RetryDisposition (naming is hard) type that could be used as an ExitValue. This could then be used by what's calling it to figure out what to do next. All that could come later though (if at all).

@@ -131,6 +131,8 @@ The post run hook will get executed after initial startup.

For many data services creation of specific users / roles or datastores is required. This needs to happen once the service has already started.

The retry behavior of this hook is determined by its exit code. Exit code `0` indicates success, and the hook will not be run again. Exit code `1` indicates a failure with a retry. The `post-run` hook will immediately be executed again. Continually exit with `1` to keep retrying the `post-run` hook. Any other exit code indicates failure without a retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning that the service will continue running the post-run hook exits with a "fail but don't restart" code. This is what it did before (i.e., if the post-run hook failed, that'd be it and the service would go on its merry way), but it would be good to have that explicitly documented.

Long-term we'll want to think about whether or not this is the "correct" behavior, but that brings to mind Erlang-style restart directives and bigger questions that I think would be best to address deliberately and separately.

/// Consumes the handle to that future in the process.
fn stop_post_run(&mut self) {
if let Some(h) = self.post_run_handle.take() {
h.terminate();
}
}

// This hook method looks different from all the others because
// it's the only one that runs async right now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably remove this particular comment, since it's no longer the only async hook 😂

hook_runner.clone()
.into_future()
.map(move |(exit_value, _duration)| {
if hook_runner.hook.retry(exit_value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you added a retry function to the Hook trait to allow for the (temporal) future where we can start to generalize this kind of hook behavior, rather than tightly-coupling this to the post-run hook specifically. We don't lose anything, but we have a nice foundation from which to build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add some logging here (probably debug!-level) indicating that we're rerunning the offending hook.

@@ -285,6 +285,8 @@ pub trait Hook: fmt::Debug + Sized + Send {
status: ExitStatus)
-> Self::ExitValue;

fn retry(&self, _exit_value: Self::ExitValue) -> bool { false }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always good to add some documentation comments 😄

Copy link
Contributor

@baumanj baumanj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. All the comments are minor tweaks that you can take or leave. With the exception of the TODO, which should be addressed before merge and is why this review requests changes.

/// Consumes the handle to that future in the process.
fn stop_post_run(&mut self) {
if let Some(h) = self.post_run_handle.take() {
h.terminate();
}
}

// This hook method looks different from all the others because
// it's the only one that runs async right now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't true any longer, is it?

/// Stop the `post-run` future. The `post-run` hook has retry logic based on its exit code. This
/// will stop this retry loop regardless of `post-run`'s exit code.
///
/// Consumes the handle to that future in the process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to indicate that in the name of the method rather than the comment? Or maybe a more interesting question is: does the caller of the method need to know this?

}

/// Stop the `post-run` future. The `post-run` hook has retry logic based on its exit code. This
/// will stop this retry loop regardless of `post-run`'s exit code.
Copy link
Contributor

@baumanj baumanj Jul 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hesitate to comment about the behavior of other code in comments, as they're extra likely to become incorrect. See the comment on post_stop for instance. With the second sentence removed, I think this comment is great.


// TODO (DM): Make `HookRunner` and this method allow arbirary configuration of how to run a
// hook. For example, a configurable backoff, timeout, or failure tracking. It would also be
// nice to merge this with the healthcheck hook retry logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note to make sure this TODO is resolved before merge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, this sounds a bit like YAGNI. Do we have a use case that demands this today? If not, I'd leave it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I will remove this comment. I wanted a way to signify to reviewers that this is where we could add this functionality and get their feedback. In that way the comment has served its purpose 😄 . Maybe a better method would have been to just leave a comment on the PR?

I do think it would be useful to merge this logic with the health check retry logic, and there is an issue about adding an initial delay before starting a service and many of those other features are currently TODOs in other parts of the code. Ultimatly, I think we will end up implementing some of this logic, but it is outside the scope of this PR.

0 => false,
1 => true,
_ => false,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

        exit_value.0 == 1

?

Even better, define a constant (at the fn or type level, either seems good) and you get:

        exit_value == SHOULD_RETRY

You'll have to derive PartialEq for ExitCode, but that seems worth it to me.

Also, unrelated, if feels like this method would more accurately be named should_retry.

@davidMcneil davidMcneil force-pushed the dmcneil/retry-post-run-hook branch from 27439bb to 43aa4a8 Compare July 8, 2019 17:57
Copy link
Contributor

@christophermaier christophermaier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I think we need to ensure that the post-run hook is definitely stopped when we detach the Supervisor (otherwise, we stand to be back in the same situation that required #6717).

That raises an interesting question, though... if a post-run hook is failing (and thus re-running itself) when the Supervisor goes down for an update, do we restart it when the Supervisor comes back up, as we do with health checks? If so, then we'd need to somehow know which post-runs were failing and only bring them back up (otherwise, we could restart all post-run hooks, even ones that had successfully completed).

If we expect post-run hooks to be idempotent, that's probably OK, but since post-run hooks have only ever been run once until this point, I'm not sure how valid that is in practice (though I would hope it would be).

Alternatively, we could just not restart post-run hooks when the Supervisor reattaches to a service. That would certainly be the most straightforward to implement (we wouldn't need to persist some kind of state across Supervisor restarts). Perhaps that's what we do for now, with the potential to follow up in the future if necessary. Any other thoughts? @davidMcneil @baumanj @raskchanky

At the very least, we should ensure that the post-run future is stopped when we detach.

Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@baumanj
Copy link
Contributor

baumanj commented Jul 17, 2019

Alternatively, we could just not restart post-run hooks when the Supervisor reattaches to a service.

I'm generally in favor of doing the simplest thing/YAGNI. The only thing to consider is whether it would be hard to change later. What's the impact on folks who get accustomed to it not restarting post-stop after reattach? My guess is nothing, but that's the thing I would think more about.

@@ -363,6 +366,7 @@ impl Service {
pub fn stop(&mut self,
shutdown_config: ShutdownConfig)
-> impl Future<Item = (), Error = Error> {
self.stop_post_run();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name stop_post_run makes it clear what is happening, so I think the comment here should be focused on why that's necessary.

@@ -40,7 +57,26 @@ impl<H> HookRunner<H> where H: Hook + Sync
pkg,
passwd }
}

pub fn retriable_future(&self) -> (FutureHandle, impl Future<Item = (), Error = ()>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"retriable" is a totally valid spelling, though it might be worth considering using "retryable", so it turns up in a search for "retry"

hook_runner.clone()
.into_future()
.map(move |(exit_value, _duration)| {
if <H as Hook>::should_retry(&exit_value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just

Suggested change
if <H as Hook>::should_retry(&exit_value) {
if H::should_retry(&exit_value) {

? It seems to work. Similarly with file_name. If you think H is unclear, it's fine to be explicit and call it HOOK. There's no need for template variables to be limited to single characters.

Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
Signed-off-by: David McNeil <mcneil.david2@gmail.com>
@davidMcneil davidMcneil force-pushed the dmcneil/retry-post-run-hook branch from 43aa4a8 to bf295ea Compare July 17, 2019 19:25
@davidMcneil
Copy link
Contributor Author

davidMcneil commented Jul 17, 2019

Alternatively, we could just not restart post-run hooks when the Supervisor reattaches to a service.

This should add the logic to detach the post-run hook retry future. It seems like storing the state of the post-run hook is the right solution, but is probably more work than can be justified without a specific request for it. I opened an issue here just so we are aware of it.

I snuck two refactoring commits on top of this change that are related but not core to the issue. I will happily revert them before merging if they are unhelpful.

Copy link
Contributor

@christophermaier christophermaier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks also for adding the follow-up issue for #6739.


/// Return a future that will shut down a service, performing any
/// necessary cleanup, and run its post-stop hook, if any.
pub fn stop(&mut self,
shutdown_config: ShutdownConfig)
-> impl Future<Item = (), Error = Error> {
self.stop_health_checks();
self.detach();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

if self.check_process() {
// If the service is not initialized and the process is still running, the Supervisor
// was restarted and we just have to reattach to the process.
if up {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a welcome refactoring 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

post-run hook doesn't re-run on exit 1
3 participants