Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide Health Checks for external Systems #2687

Open
ColinSullivan1 opened this issue Nov 10, 2021 · 4 comments
Open

Provide Health Checks for external Systems #2687

ColinSullivan1 opened this issue Nov 10, 2021 · 4 comments
Assignees

Comments

@ColinSullivan1
Copy link
Member

Feature Request

A number of external systems could utilize introspection into the readiness and liveness of the NATS server, such as K8s and others (see #1903). This will provide much better UX for K8s users and reduce errors on startup, resource issues, and loss of quorum.

Suggestions for Discussion

Check State Suggested Endpoint
Startup (Core NATS) Servers are Ready /healthz?current-cluster-size=N
Startup (JetStream) Servers are Ready /healthz?current-cluster-size=N&quorum=true
Readiness (Core NATS) Accepting Client Connections /healthz
Readiness (JetStream) Accepting Client Connections & Is a caught up leader or follower* /healthz?isCandidate=false
Liveness (Core NATS) N/A (Server will stop on it own) /healthz
Liveness (JetStream) Jetstream Subsystem is Running /healthz?js-enabled=true

*Not sure if readiness failures would prevent cluster traffic (TBD).

Liveness (JetStream) would fail if the JetStream subsystem has been shutdown due to lack of resources, unavailable PVC, etc.

The endpoints would return 200 if successful.

Startup, Liveness, and Readiness probes would significantly help in terms of startup and potentially reduce time to problem resolution (especially the Liveness probe constrained resources in k8s).

This may not be correct but I hope to spur discussion and am looking for community feedback in this area.

CC @nats-io/core @wallyqs @ripienaar

@c16a
Copy link

c16a commented Dec 6, 2021

I think while the current probes which simply check whether 8222:/ is responding with a 200 OK or not, granular health statuses are always recommended.

It would be great if this is done. Can I help?

@ColinSullivan1
Copy link
Member Author

We feel this will be resolved by #2815; additional testing will determine if that PR covers everything we need.

@c16a , thank you so much for the offer to help - much appreciated! I think we have this covered. There are plenty of issues open for contributors; don't hesitate to reach out if you find one that interests you.

@Himani2000
Copy link

Do we have a health check endpoint which gives us the cluster health? Currently I have a NATS cluster which has say (n) number of servers in it. My understanding is that /healthz gives the NATS server health check and not of the entire cluster.

@derekcollison
Copy link
Member

I would suggest using the NATS cli. You need to have a system account access.

nats server check meta --expect=9 --lag-critical=5 --seen-critical=1s

@tbeets tbeets removed their assignment Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants