Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

post-run hook doesn't re-run on exit 1 #2364

Closed
jamessewell opened this issue May 15, 2017 · 7 comments · Fixed by #6705
Closed

post-run hook doesn't re-run on exit 1 #2364

jamessewell opened this issue May 15, 2017 · 7 comments · Fixed by #6705
Assignees
Labels
Focus:Supervisor ProcessManagement Related to how the Supervisor manages service processes Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Type: Bug Issues that describe broken functionality

Comments

@jamessewell
Copy link
Contributor

jamessewell commented May 15, 2017

Running hab from hab-0.22.1-20170509234454-x86_64-linux on CentOS 7.2, deploying into Docker.

It looks like when a unsuccessful code is encountered from the post-run hook script the script is not run again, like we see with init, run etc...

The use case for this would be:

  1. Start the application in run
  2. Perform some once off config which requires the application to be up in post-run

The two options I can see here are block in run until my service becomes available for config, or have the post-run script accept an exit 1 as requiring another post-run pass.

If this is the intended functionality ideas are welcome!

@srenatus
Copy link
Contributor

I, too, have just expected it to be re-run, and was surprised it didn't. So, that's another teeny-tiny data point 😉

@jamessewell
Copy link
Contributor Author

jamessewell commented Aug 14, 2017 via email

@rsertelon
Copy link
Contributor

@jamessewell any news on that PR? Is this something still relevant today? thanks!

@themightychris
Copy link
Contributor

Relevant case: habitat-sh/core-plans#1674 (comment)

@jsirex
Copy link
Contributor

jsirex commented Jan 23, 2019

This really must have.

Use-case:

  1. I'm starting a cluster.
  2. But habitat discovery is not happened yet (no members, they are starting)
  3. post-run hook get executed but without any sense - cluster not ready
  4. discovery happened
  5. Service restarted
  6. We are ready for the post-run hook but it will not be executed

Also it is not clear how much to wait in step 3? 1 second, 15 seconds, 8 minutes? Will service actually restart while post-run hook is running? Or hook gets killed? Or service will wait? Too many questions.

Immediate exit with non-zero code is better.

@christophermaier
Copy link
Contributor

Taking some questions posed by @davidMcneil and putting them here (along with some thoughts) for posterity and broader discussion. As usual, all this stuff is open to discussion with anyone that has thoughts to contribute!

Should post-run only run after a successful health check?

I'd say "no" at this point, since it's conceivable (even likely?) that what is done in post-run could be required for a successful health check.

How many times should post-run be tried?

We don't currently have a way of restricting restarts of services, which makes me want something like Erlang's supervisor restart intensity configuration. Implementing something like that in a holistic way seems like a feature on its own, and adding ad-hoc partial throttling right now seems like it'd complicate things necessarily.

Should we use status codes to indicate if it should be rerun?

0: successful do not retry
1: failed retry
2: failed do not retry

That sounds like a great idea, with the observation that the "failed do not retry" case seems linked to the "how many times should we retry" question above.

Should there be a delay between tries or a backoff strategy?

Could be a good idea. It seems like we could ultimately model the post-run execution similar to how we do the health-check (i.e., run repeatedly, with a potentially configurable duration between runs, until we get the outcome we want). That provides a nice place to localize delay and backoff logic. If implemented properly, it would be easy to drop this in later, pending further product investigation (i.e., do we want to allow for these parameters to be customized on a per-service basis, or are they just hard-coded aspects of how the Supervisor operates?) (though @jsirex's comment above suggests that per-service customization could indeed be useful)

Should failure to run a lifecycle hook influence the result of a health check?

Possibly, though I think an argument can be made that if a given hook fails to do something important for the health of a service, then the health check hook should be explicitly checking for that.

Given that, I'm inclined to leave out any explicit connection between other hooks and health checking for right now. If we get user feedback to the contrary, we can certainly revisit it.

Should similar rules apply to other lifecycle hooks?

Probably so, but depending on the specifics of that hook. It would be interesting to see if we could eventually model all hooks as being customized implementations of some general "hook prototype" that has customization points for these behaviors you've mentioned.

@christophermaier
Copy link
Contributor

Also, @jsirex's comments above:

Will service actually restart while post-run hook is running? Or hook gets killed? Or service will wait? Too many questions.

suggest that some kind of overall state machine for a service will be important in order to properly track these kinds of transitions.

@christophermaier christophermaier added Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Type: Bug Issues that describe broken functionality and removed A-supervisor labels Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Focus:Supervisor ProcessManagement Related to how the Supervisor manages service processes Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Type: Bug Issues that describe broken functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants