Update to allow postrun to only run after a successful HealthCheck #5331

jamessewell · 2018-07-13T02:44:56Z

Three changes:

postrun removed from run on init to run exactly once after HealthCheck is OK
re-run postrun after each service Start
changed self.health_check so it gets assigned (it didn't before)

This does somewhat change the behaviour of postrun, which used to only
run at init - this part can be backed out if needed.

Three changes: - postrun removed from run on init to run exactly once after HealthCheck is OK - re-run postrun after each service Start - changed self.health_check so it gets assigned (it didn't before) This does slightly change the behaviour of post-run - it will now re-run every time the service starts. This can be backed off if needed. Signed-off-by: James Sewell <james.sewell@gmail.com>

thesentinels · 2018-07-13T02:44:58Z

Thanks for the pull request! Here is what will happen next:

Your PR will be reviewed by the maintainers
If everything looks good, one of them will approve it, and your PR will be merged.

Thank you for contributing!

themightychris · 2018-07-13T18:39:01Z

@jamessewell I just got up and running with this, looks good!

So far I've verified that:

post-run doesn't execute until after the first successful health_check
if heatlh_check fails, post-run doesn't run
if health_check fails initially, and succeeds later, post-run does run

Some issues I see though:

It takes quite a while for the first health_check to run
- With post-run deferred until then this means it takes quite a while for the service to be finished starting up, while the supervisor reports it as up all along
- Perhaps an initial health_check could be run immediately after run like post-run used to?
- When a post-run hook is configured, could the supervisor defer considering the service up until it has run?
post-run does not get re-run when a config change causes it to be recompiled and the service reloaded

Neither of these issues are show-stoppers for me though, I'd be happy with this PR getting merged as-is and consider it an improvement in post-run behavior with those issues being addressed later

christophermaier · 2018-07-13T21:25:45Z

Still need to review this, but #5327 and #5326 are of tangential interest here.

(Not saying merging this would be blocked by those; merely spreading the word to people that would be interested).

jamessewell · 2018-07-14T03:08:43Z

Hi Chris, I think it’s actually doing a first health check quickly, which fails - then it’s a 30sec standoff till the next one. If you monitor the health check state it goes - UNKNOWN - FAILING - OK I’ll have a bit more of a poke - but the propsed changes below are much better than this solution! Cheers, James Sewell

…

On Sat, 14 Jul 2018 at 4:39 am, Chris Alfano ***@***.***> wrote: @jamessewell <https://github.com/jamessewell> I just got up and running with this, looks good! So far I've verified that: - post-run doesn't execute until after the first successful health_check - if heatlh_check fails, post-run doesn't run - if health_check fails initially, and succeeds later, post-run does run Some issues I see though: 1. It takes quite a while for the first health_check to run - With post-run deferred until then this means it takes quite a while for the service to be finished starting up, while the supervisor reports it as up - Perhaps an initial health_check could be run immediately after run like post-run used to? 2. post-run does not get re-run when a config change causes it to be recompiled and the service reloaded — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5331 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKRozF01HbUyWYTh687Gs6M0Eqt9m4uks5uGOlKgaJpZM4VONsE> .

baumanj · 2018-07-25T15:50:09Z

@jamessewell, are you still planning additional changes to this PR, or should it be considered final for review at this point?

baumanj · 2018-08-01T13:35:48Z

I'm going to close this for now, so that our PR reminders don't think we're ignoring it. Whenever you're ready, feel free to reopen @jamessewell.

jamessewell · 2018-08-01T13:58:17Z

I’m not really sure what to do about this one - I was hoping Chris would chime back in. It works, but the other proposed (larger) solution is better.

…

On Wed, 1 Aug 2018 at 11:35 pm, baumanj ***@***.***> wrote: Closed #5331 <#5331>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5331 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKRo6Z9jovUr5h-z4tNlEJxIrLPERZGks5uMa63gaJpZM4VONsE> .

baumanj · 2018-08-01T14:18:02Z

@christophermaier, can you weigh in?

Sorry if I was premature, @jamessewell, I was reading

I’ll have a bit more of a poke

as an indication I should hold off on review until you had done more.

christophermaier · 2018-08-01T14:42:32Z

Sorry @jamessewell ... I'll get around to reviewing this soon.

themightychris · 2018-08-02T14:38:00Z

I was just wondering, would it make sense to leave the current post-run behavior alone and introduce this as a new hook, post-up/post-available/post-online/post-healthy?

This would have the benefit of definitely not breaking any existing services, and allows post-run code to play a role in getting a service into the healthy state which might be important

jamessewell · 2018-08-02T22:34:15Z

That makes sense - although I do wonder what the point of the old post run would be apart from backwards compatibility?

…

On Fri, 3 Aug 2018 at 12:39 am, Chris Alfano ***@***.***> wrote: I was just wondering, would it make sense to leave the current post-run behavior alone and introduce this as a new hook, post-up/post-available/ post-online/post-healthy? This would have the benefit of definitely not breaking any existing services, and allows post-run code to play a role in getting a service into the healthy state which might be important — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5331 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKRo58N3hJw_ZXTU_Dx5y7P0fRNoYrNks5uMw8UgaJpZM4VONsE> .

christophermaier · 2018-08-10T14:53:51Z

@jamessewell Sorry it's taken a while to get to this!

I don't want to change the existing contract of post-run only running after the initial startup of the service, even though I'm not 100% convinced that it's current behavior is quite correct. There are a handful of non-trivial post-run hooks in the wild that this could negatively affect. There very well may be room for additional lifecycle hooks, though, as suggested above by @themightychris; I'd be very interested in your thoughts on some of those.

Also, given the relatively long time until a health-check initially fires, this could cause services to be in a potentially incomplete state for a long time. This wouldn't be a problem after #5327, and possibly #5326, are implemented, though, since services wouldn't be available to the rest of the network until they're healthy (and, presumably, after their post-run hook has been successful).

There's a lot of work currently planned around all the lifecycle hooks (#5318) (and I'm currently starting work on them), and I think this PR points out some additional real issues. As is, though, I think the potential for introducing additional instability and breakage is high, so I'm going to close this for now.

I appreciate the work and effort you've put in thus far, and apologize for taking as long as I have to give you some feedback on this.

themightychris · 2018-08-10T15:58:13Z

@christophermaier deferring services being available to the rest of the network doesn't solve the use case here. All the related PRs you linked to are great and related, but they are far broader in scope than the use case at hand here:

CI tests currently flag sleep in post-run
There is a common need for packages to run hook code after a service comes up to complete configuration. These are things you might otherwise want to do in init, but need to issue commands to a running service rather than templating config files to affect
The current ways to do that, by example/precedent/documentation, are extremely gnarly. I dug through habitat history and tried both on the way to [zerotier] Add zerotier plan core-plans#1674:
- init hooks that run a service once, background it and store its pid, wait for it to come up, does its init, then kill it by pid and start it a second time
- After post-run became a thing people started doing sleep loops there for this
Both are blocked by CI policy now, and they probably should be
Both tend towards replicating or stripping down the code in health_check

I agree it's probably best not to change post-run, there's definitely a need for it and it would be bad to break all the existing ones, but a lot of those existing ones don't pass current standards anymore because they're working around not having a post-healthy hook

@jamessewell I think a very simple derivation of this PR would stop this gnarly from spreading and let packages that need this start passing CI again:

Call a new post-healthy hook after the first health_check pass
Extra awesome sauce: when a post-healthy hook is present, accelerate initial health_check frequency to every 2 seconds up to a limit of like 5 tries

christophermaier · 2018-08-28T13:37:00Z

Posted in Slack, but copied here for visibility / posterity:

I think it may be the right way to go. At this point, though, I think it might be a better idea to have some kind of RFC / broader discussion around exactly what the ideal set of lifecycle hooks should be, and really nail down their semantics with a lot of input from the broader community. I'd like to avoid a situation where we keep adding hooks to get past the problem-of-the-day (which may be due to current Supervisor implementation details) and end up with an ultimately confused and incoherent global picture of things. If that sounds useful, I'll start some discussions today to see what we can do to get that started.

jamessewell requested review from baumanj, christophermaier, fnichol and reset as code owners July 13, 2018 02:44

themightychris mentioned this pull request Jul 13, 2018

[zerotier] Add zerotier plan habitat-sh/core-plans#1674

Closed

2 tasks

baumanj mentioned this pull request Jul 16, 2018

Update to allow postrun to only run after a successful HealthCheck #5330

Closed

baumanj added the X-change label Jul 16, 2018

baumanj closed this Aug 1, 2018

baumanj reopened this Aug 1, 2018

christophermaier closed this Aug 10, 2018

christophermaier added Type:Breaking Change PRs that are classified as a change to existing behavior and removed X-change labels Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to allow postrun to only run after a successful HealthCheck #5331

Update to allow postrun to only run after a successful HealthCheck #5331

jamessewell commented Jul 13, 2018

thesentinels commented Jul 13, 2018

themightychris commented Jul 13, 2018 •

edited

Loading

christophermaier commented Jul 13, 2018

jamessewell commented Jul 14, 2018 via email

baumanj commented Jul 25, 2018

baumanj commented Aug 1, 2018

jamessewell commented Aug 1, 2018 via email

baumanj commented Aug 1, 2018 •

edited

Loading

christophermaier commented Aug 1, 2018

themightychris commented Aug 2, 2018

jamessewell commented Aug 2, 2018 via email

christophermaier commented Aug 10, 2018

themightychris commented Aug 10, 2018

christophermaier commented Aug 28, 2018

Update to allow postrun to only run after a successful HealthCheck #5331

Update to allow postrun to only run after a successful HealthCheck #5331

Conversation

jamessewell commented Jul 13, 2018

thesentinels commented Jul 13, 2018

themightychris commented Jul 13, 2018 • edited Loading

christophermaier commented Jul 13, 2018

jamessewell commented Jul 14, 2018 via email

baumanj commented Jul 25, 2018

baumanj commented Aug 1, 2018

jamessewell commented Aug 1, 2018 via email

baumanj commented Aug 1, 2018 • edited Loading

christophermaier commented Aug 1, 2018

themightychris commented Aug 2, 2018

jamessewell commented Aug 2, 2018 via email

christophermaier commented Aug 10, 2018

themightychris commented Aug 10, 2018

christophermaier commented Aug 28, 2018

themightychris commented Jul 13, 2018 •

edited

Loading

baumanj commented Aug 1, 2018 •

edited

Loading