Verify NATS health checks #14964

k15r · 2022-07-29T05:56:13Z

Descriptions

in a recent incident the NATS Server itself reported as healthy (via its health checks). Unfortunately only the Core Nats control plane worked properly. Jetstream itself could not be accessed.

Acceptance

investigate why the servers health check does not fail if Jetstream is inaccessible
investigate if it could be changed
investigate (with NATS-io) if could have any negative effects on the cluster if nodes would be killed automatically in such a scenario

k15r · 2022-07-29T13:33:32Z

TB: 3d

mfaizanse · 2022-08-01T13:38:33Z

There are some improvements related to RAFT and heartbeats for JetStream in v2.8.3
Relevant issue on NATS: Provide Health Checks for external Systems
Relevant comment: https://github.com/nats-io/nats-server/pull/2815/files#r792026737
/heathz endpoint source: https://github.com/nats-io/nats-server/blob/903a06a5b4ee512b7c5231222d298ca7e3710c4f/server/monitor.go#L2945

mfaizanse · 2022-08-01T14:58:40Z

Deployed Kyma with production profile

Action 1

Deleted the /data/jetstream directory in eventing-nats-0 Pod.
Published an event.
JetStream got disabled in eventing-nats-0 Pod.

[30] 2022/08/01 14:27:46.616717 [ERR] JetStream out of resources, will be DISABLED
[30] 2022/08/01 14:27:46.616814 [WRN] JetStream initiating meta leader transfer
[30] 2022/08/01 14:27:46.616941 [INF] JetStream cluster no metadata leader
[30] 2022/08/01 14:27:48.617328 [WRN] JetStream timeout waiting for meta leader transfer
[30] 2022/08/01 14:27:48.618313 [INF] Initiating JetStream Shutdown...
[30] 2022/08/01 14:27:48.618487 [INF] JetStream Shutdown

Health probe http://localhost:8222/healthz returns 200 OK.
JetStream Info http://localhost:8222/jsz returns data.disabled: true:

{
   "server_id":"NBKAYBIGPZ73E4LYEY64FU2BWTCGO53TTEYLJ5NVCP3DTM4H2DNQZFL6",
   "now":"2022-08-01T14:57:56.528474816Z",
   "disabled":true,
   "config":{
      "max_memory":0,
      "max_storage":0
   },
   "memory":0,
   "storage":0,
   "reserved_memory":0,
   "reserved_storage":0,
   "accounts":0,
   "ha_assets":0,
   "api":{
      "total":0,
      "errors":0
   },
   "streams":0,
   "consumers":0,
   "messages":0,
   "bytes":0
}

Action 2

Did the same actions with eventing-nats-1 Pod.
The JetStream system becomes temporarily unavailable.

╰─ nats account info
Connection Information:

               Client ID: 80
               Client IP: 127.0.0.1
                     RTT: 44.96502ms
       Headers Supported: true
         Maximum Payload: 1.0 MiB
       Connected Cluster: eventing-nats
           Connected URL: nats://127.0.0.1:4222
       Connected Address: 127.0.0.1:4222
     Connected Server ID: NBKAYBIGPZ73E4LYEY64FU2BWTCGO53TTEYLJ5NVCP3DTM4H2DNQZFL6
   Connected Server Name: eventing-nats-0

JetStream Account Information:

   Could not obtain account information: JetStream system temporarily unavailable (10008)

Only the last JetStream enabled Pod now returns http://localhost:8222/healthz:

{
   "status":"unavailable",
   "error":"JetStream has not established contact with a meta leader"
}

The other 2 NATs Pods returnshttp://localhost:8222/healthz: (because JetStream is disabled on them now)

{
"status": "ok"
}

mfaizanse · 2022-08-02T12:02:22Z

http://localhost:8222/jsz

{
   "server_id":"NDQSBNWV5E54T53H2YGMNFZINVOOOG4YO5XXUJVKYKNPZF4ZJ7DPF33I",
   "now":"2022-08-02T07:17:44.752285558Z",
   "config":{
      "max_memory":1073741824,
      "max_storage":1073741824,
      "store_dir":"/data/jetstream"
   },
   "memory":0,
   "storage":0,
   "reserved_memory":0,
   "reserved_storage":0,
   "accounts":1,
   "ha_assets":3,
   "api":{
      "total":2,
      "errors":0
   },
   "streams":1,
   "consumers":1,
   "messages":0,
   "bytes":0,
   "meta_cluster":{
      "name":"eventing-nats",
      "leader":"eventing-nats-1",
      "cluster_size":3
   }
}

mfaizanse · 2022-08-02T12:48:30Z

JetStream related metrics exported by the nats-promethues-exporter at http://localhost:7777/metrics:

mfaizanse · 2022-08-02T13:13:09Z

mfaizanse · 2022-08-03T08:51:29Z

Added a PR to export nats_server_jetstream_disabled.

# HELP nats_server_jetstream_disabled JetStream disabled or not
# TYPE nats_server_jetstream_disabled gauge
nats_server_jetstream_disabled{cluster="xxx-nats",domain="",is_meta_leader="false",meta_leader="xxx-nats-1",server_id="NBUCPIJXXPLYEEKPF6C7EBZDFX3QCDHLEO2II4YMU5JEJXEAMOJ7AHJZ",server_name="xxx-nats-2"} 0

mfaizanse · 2022-08-04T07:59:50Z

Findings:
Currently the health checks of JetStream are not reliable. Also, if any I/O error happens then NATS Server will just disable JetStream in that NATS instance and start reporting 200 OK in the health checks, which is not consistent behaviour. But in the /jsz endpoint it will return response.disabled: true. Also, sometimes even if the JetStream is not in sync or lagging, the health check returns OK.

Proposed follow-ups:

Upgrade NATS version to latest.
[Suggestion 2] Change liveness check from / endpoint to /healthz because /healthz internally also does some health checks for JetStream server, streams and consumers.
- I opened a PR which would allow us to config the behaviour of /healthz.
  - /healthz?js-enabled=true will return non-healthy status if JetStream is disabled on that instance.
  - /healthz?js-enabled=true&js-server-only=true will only check JetStream server but not the streams and consumers.
Create an alert if nats_server_jetstream_disabled changes from false to true. Wait until this PR is included in new release.
[Alternative to Suggestion 2, if /healthz is not still reliable] Have a sidecar health check container to NATS Pods, which continuously queries the /jsz and /healthz and checks in depth if the NATS instance is healthy or should it be restarted. We can use liveness check on this container.

mfaizanse · 2022-08-09T08:50:31Z

Created Follow-ups:

k15r added the area/eventing Issues or PRs related to eventing label Jul 29, 2022

mfaizanse self-assigned this Jul 29, 2022

raypinto assigned marcobebway Aug 4, 2022

mfaizanse mentioned this issue Aug 9, 2022

Improve liveness health check for NATS JetStream #15046

Closed

5 tasks

mfaizanse mentioned this issue Aug 9, 2022

Create an alert if nats_server_jetstream_disabled changes from false to true. #15047

Closed

1 task

mfaizanse closed this as completed Aug 9, 2022

mfaizanse unassigned marcobebway Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify NATS health checks #14964

Verify NATS health checks #14964

k15r commented Jul 29, 2022 •

edited by mfaizanse

Loading

k15r commented Jul 29, 2022

mfaizanse commented Aug 1, 2022 •

edited

Loading

mfaizanse commented Aug 1, 2022 •

edited

Loading

mfaizanse commented Aug 2, 2022

mfaizanse commented Aug 2, 2022

mfaizanse commented Aug 2, 2022

mfaizanse commented Aug 3, 2022

mfaizanse commented Aug 4, 2022 •

edited

Loading

mfaizanse commented Aug 9, 2022 •

edited

Loading

Verify NATS health checks #14964

Verify NATS health checks #14964

Comments

k15r commented Jul 29, 2022 • edited by mfaizanse Loading

Descriptions

Acceptance

k15r commented Jul 29, 2022

mfaizanse commented Aug 1, 2022 • edited Loading

mfaizanse commented Aug 1, 2022 • edited Loading

mfaizanse commented Aug 2, 2022

mfaizanse commented Aug 2, 2022

mfaizanse commented Aug 2, 2022

mfaizanse commented Aug 3, 2022

mfaizanse commented Aug 4, 2022 • edited Loading

mfaizanse commented Aug 9, 2022 • edited Loading

k15r commented Jul 29, 2022 •

edited by mfaizanse

Loading

mfaizanse commented Aug 1, 2022 •

edited

Loading

mfaizanse commented Aug 1, 2022 •

edited

Loading

mfaizanse commented Aug 4, 2022 •

edited

Loading

mfaizanse commented Aug 9, 2022 •

edited

Loading