-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporate health into gossip #5326
Comments
This is a great idea - I think if Habitat wants health checks to feel like less of a bolt-on it needs something like this. Could implementing this help with the ground work of allowing Health Checks to be used for triggering a Leader/Follower topology failover as well (#3249? |
@jamessewell Yes, for sure this would be fundamental for #3249. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you. |
Still needed. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you. |
1 similar comment
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you. |
Health checks currently do not participate in the Habitat network; they're just run on a timer on the side of the main service loop, and report information only via the HTTP gateway. Other services in the Habitat network that depend on an unhealthy service have no way of knowing whether that service is actually healthy.
We'll need to think about how to best do this; too eagerly broadcasting that a service is unhealthy could have cascading effects, particularly if the failing health check is only a transient issue; we should have some kind of threshold like "the last X health checks failed; it's officially Unhealthy". This is basically the Service-level analog of the SWIM suspicion mechanism.
We'll also need to think of how to best expose this in templating data so dependent services can take advantage of it. As stated in #5325, we currently conflate presence of a Supervisor with the presence/health of the services running on that Supervisor. In our templating data, we currently only present service group members that are either "alive" or "suspect" (these are, of course, Supervisor-level states, and not Service-level states); we can probably just flip this over to using presence (#5325) and / or health (this issue) and preserve the desired semantics (and actually be correct about it 😄 )
We may also want to expose some service-level runtime configuration options to control the frequency and threshold of checking:
hab svc load foo/bar --check-every=20s --unhealthy-after=3
, or similar. Right now, health checks occur with a hard-coded 30 second period. We could add additional metadata to packages, but that may be to constraining; we'd likely still want to be able to override at runtime.The text was updated successfully, but these errors were encountered: