Make post-run async and add retry logic #6705

davidMcneil · 2019-07-02T16:36:08Z

Signed-off-by: David McNeil mcneil.david2@gmail.com

christophermaier · 2019-07-02T17:40:11Z

components/sup/src/manager/service.rs

@@ -363,6 +366,7 @@ impl Service {
    pub fn stop(&mut self,
                shutdown_config: ShutdownConfig)
                -> impl Future<Item = (), Error = Error> {
+        self.stop_post_run();


Might be worth adding a comment about why we're doing this here. Stopping health checks is (hopefully!) self-explanatory, since they're running all the time, but post-run is a tad different.

The name stop_post_run makes it clear what is happening, so I think the comment here should be focused on why that's necessary.

components/sup/src/manager/service.rs

christophermaier · 2019-07-02T19:41:06Z

components/sup/src/manager/service/hook_runner.rs

+    // TODO (DM): Make `HookRunner` and this method allow arbirary configuration of how to run a
+    // hook. For example, a configurable backoff, timeout, or failure tracking. It would also be
+    // nice to merge this with the healthcheck hook retry logic.
+    pub fn loop_future(&self) -> (FutureHandle, impl Future<Item = (), Error = ()>) {


Might call this something like retriable_future to capture why it's looping.

christophermaier · 2019-07-02T19:42:29Z

components/sup/src/manager/service/hooks.rs

+        match exit_value.0 {
+            0 => false,
+            1 => true,
+            _ => false,


Just reading this code, I'd expect this case to be true... however, reading the documentation below indicates that any non-zero, non-one exit code means "fail without retrying"... having a comment about that at this code will be useful for anyone that comes through to work on this code in the future.

I am not sure what should be done here from a product perspective. I can think of three potential, reasonable options:

Option 1

0 no retry

else retry

Option 2

0 or 1 no retry

else retry

Option 3

1 retry

else no retry

There are obviously infinite other possibilities. Is there currently a way for the user to access the exit code of the post-run hook and use it? If not, and the exit code is solely used to determine retry behaviour, I now think I would change this to the simpler Option 1. Are there preferences for a particular set of logic here?

There isn't currently any mechanism by which a user can access the exit code of a hook and influence what is done as a result. Instead, each hook defines what effect the exit status will have in Hook::handle_exit. That in turn converts the status into the ExitValue type for the hook... in most cases, I don't think anything is really done with that value, but some hooks like health-check and suitability do make use of it.

I think option 1 above is fine.

As an aside, now that I'm looking a bit more closely, I'm not entirely sure what the benefit of using our own ExitCode for an ExitValue is over std::process::ExitStatus. 🤔

If retrying hooks is a general thing we will end up wanting to do, we might consider not having a Hook::retry function, but instead modify the handle_exit function of retriable hooks to convert an ExitStatus into a new RetryDisposition (naming is hard) type that could be used as an ExitValue. This could then be used by what's calling it to figure out what to do next. All that could come later though (if at all).

christophermaier · 2019-07-02T19:46:21Z

www/source/partials/docs/_reference-hooks.html.md.erb

@@ -131,6 +131,8 @@ The post run hook will get executed after initial startup.

 For many data services creation of specific users / roles or datastores is required. This needs to happen once the service has already started.

+The retry behavior of this hook is determined by its exit code. Exit code `0` indicates success, and the hook will not be run again. Exit code `1` indicates a failure with a retry. The `post-run` hook will immediately be executed again. Continually exit with `1` to keep retrying the `post-run` hook. Any other exit code indicates failure without a retry.


Worth mentioning that the service will continue running the post-run hook exits with a "fail but don't restart" code. This is what it did before (i.e., if the post-run hook failed, that'd be it and the service would go on its merry way), but it would be good to have that explicitly documented.

Long-term we'll want to think about whether or not this is the "correct" behavior, but that brings to mind Erlang-style restart directives and bigger questions that I think would be best to address deliberately and separately.

christophermaier · 2019-07-02T20:12:15Z

components/sup/src/manager/service.rs

+    /// Consumes the handle to that future in the process.
+    fn stop_post_run(&mut self) {
+        if let Some(h) = self.post_run_handle.take() {
+            h.terminate();
        }
    }

    // This hook method looks different from all the others because
    // it's the only one that runs async right now.


We can probably remove this particular comment, since it's no longer the only async hook 😂

christophermaier · 2019-07-02T20:19:00Z

components/sup/src/manager/service/hook_runner.rs

+            hook_runner.clone()
+                       .into_future()
+                       .map(move |(exit_value, _duration)| {
+                           if hook_runner.hook.retry(exit_value) {


I like how you added a retry function to the Hook trait to allow for the (temporal) future where we can start to generalize this kind of hook behavior, rather than tightly-coupling this to the post-run hook specifically. We don't lose anything, but we have a nice foundation from which to build.

It would be helpful to add some logging here (probably debug!-level) indicating that we're rerunning the offending hook.

christophermaier · 2019-07-02T20:43:13Z

components/common/src/templating/hooks.rs

@@ -285,6 +285,8 @@ pub trait Hook: fmt::Debug + Sized + Send {
                       status: ExitStatus)
                       -> Self::ExitValue;

+    fn retry(&self, _exit_value: Self::ExitValue) -> bool { false }


Always good to add some documentation comments 😄

baumanj

This looks great. All the comments are minor tweaks that you can take or leave. With the exception of the TODO, which should be addressed before merge and is why this review requests changes.

baumanj · 2019-07-02T20:32:50Z

components/sup/src/manager/service.rs

+    /// Consumes the handle to that future in the process.
+    fn stop_post_run(&mut self) {
+        if let Some(h) = self.post_run_handle.take() {
+            h.terminate();
        }
    }

    // This hook method looks different from all the others because
    // it's the only one that runs async right now.


This comment isn't true any longer, is it?

baumanj · 2019-07-02T20:34:27Z

components/sup/src/manager/service.rs

+    /// Stop the `post-run` future. The `post-run` hook has retry logic based on its exit code. This
+    /// will stop this retry loop regardless of `post-run`'s exit code.
+    ///
+    /// Consumes the handle to that future in the process.


Is there a way to indicate that in the name of the method rather than the comment? Or maybe a more interesting question is: does the caller of the method need to know this?

baumanj · 2019-07-02T20:36:10Z

components/sup/src/manager/service.rs

+    }
+
+    /// Stop the `post-run` future. The `post-run` hook has retry logic based on its exit code. This
+    /// will stop this retry loop regardless of `post-run`'s exit code.


I'd hesitate to comment about the behavior of other code in comments, as they're extra likely to become incorrect. See the comment on post_stop for instance. With the second sentence removed, I think this comment is great.

baumanj · 2019-07-02T20:41:37Z

components/sup/src/manager/service/hook_runner.rs

+
+    // TODO (DM): Make `HookRunner` and this method allow arbirary configuration of how to run a
+    // hook. For example, a configurable backoff, timeout, or failure tracking. It would also be
+    // nice to merge this with the healthcheck hook retry logic.


Just a note to make sure this TODO is resolved before merge

Personally, this sounds a bit like YAGNI. Do we have a use case that demands this today? If not, I'd leave it out.

I agree. I will remove this comment. I wanted a way to signify to reviewers that this is where we could add this functionality and get their feedback. In that way the comment has served its purpose 😄 . Maybe a better method would have been to just leave a comment on the PR?

I do think it would be useful to merge this logic with the health check retry logic, and there is an issue about adding an initial delay before starting a service and many of those other features are currently TODOs in other parts of the code. Ultimatly, I think we will end up implementing some of this logic, but it is outside the scope of this PR.

baumanj · 2019-07-02T20:50:15Z

components/sup/src/manager/service/hooks.rs

+            0 => false,
+            1 => true,
+            _ => false,
+        }


Why not

exit_value.0 == 1

?

Even better, define a constant (at the fn or type level, either seems good) and you get:

exit_value == SHOULD_RETRY

You'll have to derive PartialEq for ExitCode, but that seems worth it to me.

Also, unrelated, if feels like this method would more accurately be named should_retry.

christophermaier

This looks good, but I think we need to ensure that the post-run hook is definitely stopped when we detach the Supervisor (otherwise, we stand to be back in the same situation that required #6717).

That raises an interesting question, though... if a post-run hook is failing (and thus re-running itself) when the Supervisor goes down for an update, do we restart it when the Supervisor comes back up, as we do with health checks? If so, then we'd need to somehow know which post-runs were failing and only bring them back up (otherwise, we could restart all post-run hooks, even ones that had successfully completed).

If we expect post-run hooks to be idempotent, that's probably OK, but since post-run hooks have only ever been run once until this point, I'm not sure how valid that is in practice (though I would hope it would be).

Alternatively, we could just not restart post-run hooks when the Supervisor reattaches to a service. That would certainly be the most straightforward to implement (we wouldn't need to persist some kind of state across Supervisor restarts). Perhaps that's what we do for now, with the potential to follow up in the future if necessary. Any other thoughts? @davidMcneil @baumanj @raskchanky

At the very least, we should ensure that the post-run future is stopped when we detach.

Signed-off-by: David McNeil <mcneil.david2@gmail.com>

baumanj · 2019-07-17T16:48:33Z

Alternatively, we could just not restart post-run hooks when the Supervisor reattaches to a service.

I'm generally in favor of doing the simplest thing/YAGNI. The only thing to consider is whether it would be hard to change later. What's the impact on folks who get accustomed to it not restarting post-stop after reattach? My guess is nothing, but that's the thing I would think more about.

baumanj · 2019-07-17T16:29:51Z

components/sup/src/manager/service.rs

@@ -363,6 +366,7 @@ impl Service {
    pub fn stop(&mut self,
                shutdown_config: ShutdownConfig)
                -> impl Future<Item = (), Error = Error> {
+        self.stop_post_run();


The name stop_post_run makes it clear what is happening, so I think the comment here should be focused on why that's necessary.

baumanj · 2019-07-17T16:38:27Z

components/sup/src/manager/service/hook_runner.rs

@@ -40,7 +57,26 @@ impl<H> HookRunner<H> where H: Hook + Sync
                     pkg,
                     passwd }
    }
+
+    pub fn retriable_future(&self) -> (FutureHandle, impl Future<Item = (), Error = ()>) {


"retriable" is a totally valid spelling, though it might be worth considering using "retryable", so it turns up in a search for "retry"

baumanj · 2019-07-17T16:42:39Z

components/sup/src/manager/service/hook_runner.rs

+            hook_runner.clone()
+                       .into_future()
+                       .map(move |(exit_value, _duration)| {
+                           if <H as Hook>::should_retry(&exit_value) {


Why not just

Suggested change

if <H as Hook>::should_retry(&exit_value) {

if H::should_retry(&exit_value) {

? It seems to work. Similarly with file_name. If you think H is unclear, it's fine to be explicit and call it HOOK. There's no need for template variables to be limited to single characters.

Signed-off-by: David McNeil <mcneil.david2@gmail.com>

davidMcneil · 2019-07-17T19:28:23Z

Alternatively, we could just not restart post-run hooks when the Supervisor reattaches to a service.

This should add the logic to detach the post-run hook retry future. It seems like storing the state of the post-run hook is the right solution, but is probably more work than can be justified without a specific request for it. I opened an issue here just so we are aware of it.

I snuck two refactoring commits on top of this change that are related but not core to the issue. I will happily revert them before merging if they are unhelpful.

christophermaier

Looks good! Thanks also for adding the follow-up issue for #6739.

christophermaier · 2019-07-29T14:51:49Z

components/sup/src/manager/service.rs


    /// Return a future that will shut down a service, performing any
    /// necessary cleanup, and run its post-stop hook, if any.
    pub fn stop(&mut self,
                shutdown_config: ShutdownConfig)
                -> impl Future<Item = (), Error = Error> {
-        self.stop_health_checks();
+        self.detach();


christophermaier · 2019-07-29T14:56:51Z

components/sup/src/manager/service.rs

-            if self.check_process() {
+            // If the service is not initialized and the process is still running, the Supervisor
+            // was restarted and we just have to reattach to the process.
+            if up {


This is a welcome refactoring 👍

davidMcneil requested review from baumanj, christophermaier and raskchanky as code owners July 2, 2019 16:36

davidMcneil self-assigned this Jul 2, 2019

christophermaier reviewed Jul 2, 2019

View reviewed changes

baumanj suggested changes Jul 2, 2019

View reviewed changes

davidMcneil force-pushed the dmcneil/retry-post-run-hook branch from 27439bb to 43aa4a8 Compare July 8, 2019 17:57

davidMcneil requested review from baumanj and christophermaier July 17, 2019 12:42

christophermaier suggested changes Jul 17, 2019

View reviewed changes

davidMcneil added 2 commits July 17, 2019 12:48

Make post-run async and add retry logic

bc0e28d

Signed-off-by: David McNeil <mcneil.david2@gmail.com>

PR feedback

5d72598

Signed-off-by: David McNeil <mcneil.david2@gmail.com>

baumanj approved these changes Jul 17, 2019

View reviewed changes

PR feedback

2a5db8c

Signed-off-by: David McNeil <mcneil.david2@gmail.com>

davidMcneil mentioned this pull request Jul 17, 2019

Restart post-run retries following a supervisor restart #6739

Closed

davidMcneil added 3 commits July 17, 2019 14:36

Stop post-run future when detaching

82838cc

Signed-off-by: David McNeil <mcneil.david2@gmail.com>

Use detach when stopping a service

17c35bb

Signed-off-by: David McNeil <mcneil.david2@gmail.com>

Refactor and add comments to execute_hooks

bf295ea

Signed-off-by: David McNeil <mcneil.david2@gmail.com>

davidMcneil force-pushed the dmcneil/retry-post-run-hook branch from 43aa4a8 to bf295ea Compare July 17, 2019 19:25

davidMcneil closed this Jul 17, 2019

davidMcneil reopened this Jul 17, 2019

davidMcneil requested a review from christophermaier July 17, 2019 19:39

christophermaier approved these changes Jul 29, 2019

View reviewed changes

davidMcneil merged commit 7330768 into master Jul 29, 2019

davidMcneil deleted the dmcneil/retry-post-run-hook branch July 29, 2019 16:43

davidMcneil mentioned this pull request Jul 29, 2019

Run post-run hook asynchronously #5957

Closed

jaym mentioned this pull request Feb 14, 2020

[elasticsearch] configure the default replicas to 0 chef/automate#2871

Merged

MindNumbing mentioned this pull request Jun 3, 2020

[zerotier] Add zerotier plan habitat-sh/core-plans#1674

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make post-run async and add retry logic #6705

Make post-run async and add retry logic #6705

davidMcneil commented Jul 2, 2019

christophermaier Jul 2, 2019

baumanj Jul 17, 2019

christophermaier Jul 2, 2019

christophermaier Jul 2, 2019

davidMcneil Jul 5, 2019

christophermaier Jul 5, 2019

christophermaier Jul 2, 2019

christophermaier Jul 2, 2019

christophermaier Jul 2, 2019

christophermaier Jul 2, 2019

christophermaier Jul 2, 2019

baumanj left a comment

baumanj Jul 2, 2019

baumanj Jul 2, 2019

baumanj Jul 2, 2019 •

edited

Loading

baumanj Jul 2, 2019

baumanj Jul 2, 2019

davidMcneil Jul 8, 2019

baumanj Jul 2, 2019

christophermaier left a comment

baumanj commented Jul 17, 2019

baumanj Jul 17, 2019

baumanj Jul 17, 2019

baumanj Jul 17, 2019

davidMcneil commented Jul 17, 2019 •

edited

Loading

christophermaier left a comment

christophermaier Jul 29, 2019

christophermaier Jul 29, 2019

		@@ -131,6 +131,8 @@ The post run hook will get executed after initial startup.

		For many data services creation of specific users / roles or datastores is required. This needs to happen once the service has already started.

		The retry behavior of this hook is determined by its exit code. Exit code `0` indicates success, and the hook will not be run again. Exit code `1` indicates a failure with a retry. The `post-run` hook will immediately be executed again. Continually exit with `1` to keep retrying the `post-run` hook. Any other exit code indicates failure without a retry.

	if <H as Hook>::should_retry(&exit_value) {
	if H::should_retry(&exit_value) {

Make post-run async and add retry logic #6705

Make post-run async and add retry logic #6705

Conversation

davidMcneil commented Jul 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baumanj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baumanj Jul 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christophermaier left a comment

Choose a reason for hiding this comment

baumanj commented Jul 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidMcneil commented Jul 17, 2019 • edited Loading

christophermaier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baumanj Jul 2, 2019 •

edited

Loading

davidMcneil commented Jul 17, 2019 •

edited

Loading