Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate health into gossip #5326

Open
christophermaier opened this issue Jul 12, 2018 · 6 comments
Open

Incorporate health into gossip #5326

christophermaier opened this issue Jul 12, 2018 · 6 comments
Labels
Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Stale Type: Bug Issues that describe broken functionality

Comments

@christophermaier
Copy link
Contributor

christophermaier commented Jul 12, 2018

Health checks currently do not participate in the Habitat network; they're just run on a timer on the side of the main service loop, and report information only via the HTTP gateway. Other services in the Habitat network that depend on an unhealthy service have no way of knowing whether that service is actually healthy.

We'll need to think about how to best do this; too eagerly broadcasting that a service is unhealthy could have cascading effects, particularly if the failing health check is only a transient issue; we should have some kind of threshold like "the last X health checks failed; it's officially Unhealthy". This is basically the Service-level analog of the SWIM suspicion mechanism.

We'll also need to think of how to best expose this in templating data so dependent services can take advantage of it. As stated in #5325, we currently conflate presence of a Supervisor with the presence/health of the services running on that Supervisor. In our templating data, we currently only present service group members that are either "alive" or "suspect" (these are, of course, Supervisor-level states, and not Service-level states); we can probably just flip this over to using presence (#5325) and / or health (this issue) and preserve the desired semantics (and actually be correct about it 😄 )

We may also want to expose some service-level runtime configuration options to control the frequency and threshold of checking: hab svc load foo/bar --check-every=20s --unhealthy-after=3, or similar. Right now, health checks occur with a hard-coded 30 second period. We could add additional metadata to packages, but that may be to constraining; we'd likely still want to be able to override at runtime.

@jamessewell
Copy link
Contributor

This is a great idea - I think if Habitat wants health checks to feel like less of a bolt-on it needs something like this.

Could implementing this help with the ground work of allowing Health Checks to be used for triggering a Leader/Follower topology failover as well (#3249?

@christophermaier
Copy link
Contributor Author

@jamessewell Yes, for sure this would be fundamental for #3249.

@stale
Copy link

stale bot commented Apr 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

@stale stale bot added the Stale label Apr 2, 2020
@christophermaier
Copy link
Contributor Author

Still needed.

@stale stale bot removed the Stale label May 5, 2020
@christophermaier christophermaier added Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Type: Bug Issues that describe broken functionality and removed A-supervisor labels Jul 24, 2020
@stale
Copy link

stale bot commented Aug 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

1 similar comment
@stale
Copy link

stale bot commented Oct 15, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

@stale stale bot added the Stale label Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Focus:Supervisor Related to the Habitat Supervisor (core/hab-sup) component Stale Type: Bug Issues that describe broken functionality
Projects
None yet
Development

No branches or pull requests

4 participants