Provide Health Checks for external Systems #2687

ColinSullivan1 · 2021-11-10T20:24:56Z

Feature Request

A number of external systems could utilize introspection into the readiness and liveness of the NATS server, such as K8s and others (see #1903). This will provide much better UX for K8s users and reduce errors on startup, resource issues, and loss of quorum.

Suggestions for Discussion

Check	State	Suggested Endpoint
Startup (Core NATS)	Servers are Ready	`/healthz?current-cluster-size=N`
Startup (JetStream)	Servers are Ready	`/healthz?current-cluster-size=N&quorum=true`
Readiness (Core NATS)	Accepting Client Connections	`/healthz`
Readiness (JetStream)	Accepting Client Connections & Is a caught up leader or follower*	`/healthz?isCandidate=false`
Liveness (Core NATS)	N/A (Server will stop on it own)	`/healthz`
Liveness (JetStream)	Jetstream Subsystem is Running	`/healthz?js-enabled=true`

*Not sure if readiness failures would prevent cluster traffic (TBD).

Liveness (JetStream) would fail if the JetStream subsystem has been shutdown due to lack of resources, unavailable PVC, etc.

The endpoints would return 200 if successful.

Startup, Liveness, and Readiness probes would significantly help in terms of startup and potentially reduce time to problem resolution (especially the Liveness probe constrained resources in k8s).

This may not be correct but I hope to spur discussion and am looking for community feedback in this area.

CC @nats-io/core @wallyqs @ripienaar

The text was updated successfully, but these errors were encountered:

c16a · 2021-12-06T09:10:48Z

I think while the current probes which simply check whether 8222:/ is responding with a 200 OK or not, granular health statuses are always recommended.

It would be great if this is done. Can I help?

ColinSullivan1 · 2022-01-26T15:52:46Z

We feel this will be resolved by #2815; additional testing will determine if that PR covers everything we need.

@c16a , thank you so much for the offer to help - much appreciated! I think we have this covered. There are plenty of issues open for contributors; don't hesitate to reach out if you find one that interests you.

Himani2000 · 2023-12-03T12:04:56Z

Do we have a health check endpoint which gives us the cluster health? Currently I have a NATS cluster which has say (n) number of servers in it. My understanding is that /healthz gives the NATS server health check and not of the entire cluster.

derekcollison · 2023-12-03T17:12:34Z

I would suggest using the NATS cli. You need to have a system account access.

nats server check meta --expect=9 --lag-critical=5 --seen-critical=1s

ColinSullivan1 added the 🎉 enhancement label Nov 10, 2021

derekcollison assigned ripienaar, tbeets and ColinSullivan1 Dec 6, 2021

This was referenced Aug 1, 2022

Verify NATS health checks kyma-project/kyma#14964

Closed

Added param options to /healthz endpoint #3326

Merged

louisnow mentioned this issue May 26, 2023

Pipelines unable to perform actions on the ISB though the service is up numaproj/numaflow#752

Closed

bruth removed the 🎉 enhancement label Aug 18, 2023

tbeets removed their assignment Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide Health Checks for external Systems #2687

Provide Health Checks for external Systems #2687

ColinSullivan1 commented Nov 10, 2021

c16a commented Dec 6, 2021 •

edited

Loading

ColinSullivan1 commented Jan 26, 2022

Himani2000 commented Dec 3, 2023

derekcollison commented Dec 3, 2023

Provide Health Checks for external Systems #2687

Provide Health Checks for external Systems #2687

Comments

ColinSullivan1 commented Nov 10, 2021

Feature Request

Suggestions for Discussion

c16a commented Dec 6, 2021 • edited Loading

ColinSullivan1 commented Jan 26, 2022

Himani2000 commented Dec 3, 2023

derekcollison commented Dec 3, 2023

c16a commented Dec 6, 2021 •

edited

Loading