Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify NATS health checks #14964

Closed
3 tasks done
k15r opened this issue Jul 29, 2022 · 9 comments
Closed
3 tasks done

Verify NATS health checks #14964

k15r opened this issue Jul 29, 2022 · 9 comments
Assignees
Labels
area/eventing Issues or PRs related to eventing

Comments

@k15r
Copy link
Contributor

k15r commented Jul 29, 2022

Descriptions

in a recent incident the NATS Server itself reported as healthy (via its health checks). Unfortunately only the Core Nats control plane worked properly. Jetstream itself could not be accessed.

Acceptance

  • investigate why the servers health check does not fail if Jetstream is inaccessible
  • investigate if it could be changed
  • investigate (with NATS-io) if could have any negative effects on the cluster if nodes would be killed automatically in such a scenario
@k15r k15r added the area/eventing Issues or PRs related to eventing label Jul 29, 2022
@k15r
Copy link
Contributor Author

k15r commented Jul 29, 2022

TB: 3d

@mfaizanse mfaizanse self-assigned this Jul 29, 2022
@mfaizanse
Copy link
Member

mfaizanse commented Aug 1, 2022

@mfaizanse
Copy link
Member

mfaizanse commented Aug 1, 2022

  • Deployed Kyma with production profile

Action 1

  • Deleted the /data/jetstream directory in eventing-nats-0 Pod.
  • Published an event.
  • JetStream got disabled in eventing-nats-0 Pod.
[30] 2022/08/01 14:27:46.616717 [ERR] JetStream out of resources, will be DISABLED
[30] 2022/08/01 14:27:46.616814 [WRN] JetStream initiating meta leader transfer
[30] 2022/08/01 14:27:46.616941 [INF] JetStream cluster no metadata leader
[30] 2022/08/01 14:27:48.617328 [WRN] JetStream timeout waiting for meta leader transfer
[30] 2022/08/01 14:27:48.618313 [INF] Initiating JetStream Shutdown...
[30] 2022/08/01 14:27:48.618487 [INF] JetStream Shutdown
  • Health probe http://localhost:8222/healthz returns 200 OK.
  • JetStream Info http://localhost:8222/jsz returns data.disabled: true:
{
   "server_id":"NBKAYBIGPZ73E4LYEY64FU2BWTCGO53TTEYLJ5NVCP3DTM4H2DNQZFL6",
   "now":"2022-08-01T14:57:56.528474816Z",
   "disabled":true,
   "config":{
      "max_memory":0,
      "max_storage":0
   },
   "memory":0,
   "storage":0,
   "reserved_memory":0,
   "reserved_storage":0,
   "accounts":0,
   "ha_assets":0,
   "api":{
      "total":0,
      "errors":0
   },
   "streams":0,
   "consumers":0,
   "messages":0,
   "bytes":0
}

Action 2

  • Did the same actions with eventing-nats-1 Pod.
  • The JetStream system becomes temporarily unavailable.
╰─ nats account info
Connection Information:

               Client ID: 80
               Client IP: 127.0.0.1
                     RTT: 44.96502ms
       Headers Supported: true
         Maximum Payload: 1.0 MiB
       Connected Cluster: eventing-nats
           Connected URL: nats://127.0.0.1:4222
       Connected Address: 127.0.0.1:4222
     Connected Server ID: NBKAYBIGPZ73E4LYEY64FU2BWTCGO53TTEYLJ5NVCP3DTM4H2DNQZFL6
   Connected Server Name: eventing-nats-0

JetStream Account Information:

   Could not obtain account information: JetStream system temporarily unavailable (10008)
  • Only the last JetStream enabled Pod now returns http://localhost:8222/healthz:
{
   "status":"unavailable",
   "error":"JetStream has not established contact with a meta leader"
}
  • The other 2 NATs Pods returnshttp://localhost:8222/healthz: (because JetStream is disabled on them now)
{
"status": "ok"
}

@mfaizanse
Copy link
Member

http://localhost:8222/jsz

{
   "server_id":"NDQSBNWV5E54T53H2YGMNFZINVOOOG4YO5XXUJVKYKNPZF4ZJ7DPF33I",
   "now":"2022-08-02T07:17:44.752285558Z",
   "config":{
      "max_memory":1073741824,
      "max_storage":1073741824,
      "store_dir":"/data/jetstream"
   },
   "memory":0,
   "storage":0,
   "reserved_memory":0,
   "reserved_storage":0,
   "accounts":1,
   "ha_assets":3,
   "api":{
      "total":2,
      "errors":0
   },
   "streams":1,
   "consumers":1,
   "messages":0,
   "bytes":0,
   "meta_cluster":{
      "name":"eventing-nats",
      "leader":"eventing-nats-1",
      "cluster_size":3
   }
}

@mfaizanse
Copy link
Member

JetStream related metrics exported by the nats-promethues-exporter at http://localhost:7777/metrics:

image

image

@mfaizanse
Copy link
Member

image

@mfaizanse
Copy link
Member

Added a PR to export nats_server_jetstream_disabled.

# HELP nats_server_jetstream_disabled JetStream disabled or not
# TYPE nats_server_jetstream_disabled gauge
nats_server_jetstream_disabled{cluster="xxx-nats",domain="",is_meta_leader="false",meta_leader="xxx-nats-1",server_id="NBUCPIJXXPLYEEKPF6C7EBZDFX3QCDHLEO2II4YMU5JEJXEAMOJ7AHJZ",server_name="xxx-nats-2"} 0

@mfaizanse
Copy link
Member

mfaizanse commented Aug 4, 2022

Findings:
Currently the health checks of JetStream are not reliable. Also, if any I/O error happens then NATS Server will just disable JetStream in that NATS instance and start reporting 200 OK in the health checks, which is not consistent behaviour. But in the /jsz endpoint it will return response.disabled: true. Also, sometimes even if the JetStream is not in sync or lagging, the health check returns OK.

Proposed follow-ups:

  • Upgrade NATS version to latest.
  • [Suggestion 2] Change liveness check from / endpoint to /healthz because /healthz internally also does some health checks for JetStream server, streams and consumers.
    • I opened a PR which would allow us to config the behaviour of /healthz.
      • /healthz?js-enabled=true will return non-healthy status if JetStream is disabled on that instance.
      • /healthz?js-enabled=true&js-server-only=true will only check JetStream server but not the streams and consumers.
  • Create an alert if nats_server_jetstream_disabled changes from false to true. Wait until this PR is included in new release.
  • [Alternative to Suggestion 2, if /healthz is not still reliable] Have a sidecar health check container to NATS Pods, which continuously queries the /jsz and /healthz and checks in depth if the NATS instance is healthy or should it be restarted. We can use liveness check on this container.

@mfaizanse
Copy link
Member

mfaizanse commented Aug 9, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/eventing Issues or PRs related to eventing
Projects
None yet
Development

No branches or pull requests

3 participants