Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to allow postrun to only run after a successful HealthCheck #5331

Closed

Conversation

jamessewell
Copy link
Contributor

Three changes:

  • postrun removed from run on init to run exactly once after HealthCheck is OK
  • re-run postrun after each service Start
  • changed self.health_check so it gets assigned (it didn't before)

This does somewhat change the behaviour of postrun, which used to only
run at init - this part can be backed out if needed.

Three changes:

- postrun removed from run on init to run exactly once after HealthCheck is OK
- re-run postrun after each service Start
- changed self.health_check so it gets assigned (it didn't before)

This does slightly change the behaviour of post-run - it will now re-run every
time the service starts. This can be backed off if needed.

Signed-off-by: James Sewell <james.sewell@gmail.com>
@thesentinels
Copy link
Contributor

Thanks for the pull request! Here is what will happen next:

  1. Your PR will be reviewed by the maintainers
  2. If everything looks good, one of them will approve it, and your PR will be merged.

Thank you for contributing!

@themightychris
Copy link
Contributor

themightychris commented Jul 13, 2018

@jamessewell I just got up and running with this, looks good!

So far I've verified that:

  • post-run doesn't execute until after the first successful health_check
  • if heatlh_check fails, post-run doesn't run
  • if health_check fails initially, and succeeds later, post-run does run

Some issues I see though:

  1. It takes quite a while for the first health_check to run
    • With post-run deferred until then this means it takes quite a while for the service to be finished starting up, while the supervisor reports it as up all along
    • Perhaps an initial health_check could be run immediately after run like post-run used to?
    • When a post-run hook is configured, could the supervisor defer considering the service up until it has run?
  2. post-run does not get re-run when a config change causes it to be recompiled and the service reloaded

Neither of these issues are show-stoppers for me though, I'd be happy with this PR getting merged as-is and consider it an improvement in post-run behavior with those issues being addressed later

@christophermaier
Copy link
Contributor

Still need to review this, but #5327 and #5326 are of tangential interest here.

(Not saying merging this would be blocked by those; merely spreading the word to people that would be interested).

@jamessewell
Copy link
Contributor Author

jamessewell commented Jul 14, 2018 via email

@baumanj
Copy link
Contributor

baumanj commented Jul 25, 2018

@jamessewell, are you still planning additional changes to this PR, or should it be considered final for review at this point?

@baumanj
Copy link
Contributor

baumanj commented Aug 1, 2018

I'm going to close this for now, so that our PR reminders don't think we're ignoring it. Whenever you're ready, feel free to reopen @jamessewell.

@baumanj baumanj closed this Aug 1, 2018
@jamessewell
Copy link
Contributor Author

jamessewell commented Aug 1, 2018 via email

@baumanj
Copy link
Contributor

baumanj commented Aug 1, 2018

@christophermaier, can you weigh in?

Sorry if I was premature, @jamessewell, I was reading

I’ll have a bit more of a poke

as an indication I should hold off on review until you had done more.

@baumanj baumanj reopened this Aug 1, 2018
@christophermaier
Copy link
Contributor

Sorry @jamessewell ... I'll get around to reviewing this soon.

@themightychris
Copy link
Contributor

I was just wondering, would it make sense to leave the current post-run behavior alone and introduce this as a new hook, post-up/post-available/post-online/post-healthy?

This would have the benefit of definitely not breaking any existing services, and allows post-run code to play a role in getting a service into the healthy state which might be important

@jamessewell
Copy link
Contributor Author

jamessewell commented Aug 2, 2018 via email

@christophermaier
Copy link
Contributor

@jamessewell Sorry it's taken a while to get to this!

I don't want to change the existing contract of post-run only running after the initial startup of the service, even though I'm not 100% convinced that it's current behavior is quite correct. There are a handful of non-trivial post-run hooks in the wild that this could negatively affect. There very well may be room for additional lifecycle hooks, though, as suggested above by @themightychris; I'd be very interested in your thoughts on some of those.

Also, given the relatively long time until a health-check initially fires, this could cause services to be in a potentially incomplete state for a long time. This wouldn't be a problem after #5327, and possibly #5326, are implemented, though, since services wouldn't be available to the rest of the network until they're healthy (and, presumably, after their post-run hook has been successful).

There's a lot of work currently planned around all the lifecycle hooks (#5318) (and I'm currently starting work on them), and I think this PR points out some additional real issues. As is, though, I think the potential for introducing additional instability and breakage is high, so I'm going to close this for now.

I appreciate the work and effort you've put in thus far, and apologize for taking as long as I have to give you some feedback on this.

@themightychris
Copy link
Contributor

@christophermaier deferring services being available to the rest of the network doesn't solve the use case here. All the related PRs you linked to are great and related, but they are far broader in scope than the use case at hand here:

I agree it's probably best not to change post-run, there's definitely a need for it and it would be bad to break all the existing ones, but a lot of those existing ones don't pass current standards anymore because they're working around not having a post-healthy hook

@jamessewell I think a very simple derivation of this PR would stop this gnarly from spreading and let packages that need this start passing CI again:

  • Call a new post-healthy hook after the first health_check pass
  • Extra awesome sauce: when a post-healthy hook is present, accelerate initial health_check frequency to every 2 seconds up to a limit of like 5 tries

@christophermaier
Copy link
Contributor

Posted in Slack, but copied here for visibility / posterity:

I think it may be the right way to go. At this point, though, I think it might be a better idea to have some kind of RFC / broader discussion around exactly what the ideal set of lifecycle hooks should be, and really nail down their semantics with a lot of input from the broader community. I'd like to avoid a situation where we keep adding hooks to get past the problem-of-the-day (which may be due to current Supervisor implementation details) and end up with an ultimately confused and incoherent global picture of things. If that sounds useful, I'll start some discussions today to see what we can do to get that started.

@christophermaier christophermaier added Type:Breaking Change PRs that are classified as a change to existing behavior and removed X-change labels Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type:Breaking Change PRs that are classified as a change to existing behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants