feat: add `livez` endpoint #2418

rexagod · 2024-06-12T21:19:31Z

Add a livez endpoint to identify network outages. This helps in restarting the binary if such as case is observed.

mrueg · 2024-06-13T11:46:54Z

pkg/app/server.go

+	mux.Handle(livezPath, http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
+
+		// Query the Kube API to make sure we are not affected by a network outage.
+		got := client.CoreV1().RESTClient().Get().AbsPath("/apis/").Do(context.Background())


Should we query Kubernetes' API readyz endpoint https://kubernetes.io/docs/reference/using-api/health-checks/ instead to detect an outage of the kubernetes apiserver (I read readyz as "able to serve traffic")?

+1, API (/apis) discoverability used here is a subset of /livez. I'll make the changes.

Should this be /livez or rather /readyz ?

The Kubernetes API Server could be healthy (/livez = true) but not be able to accept client request (/ready = false).

/livez knows when to restart the container, and thus knows if there's been an outage. Besides, it makes sense to point the our /livez to the cluster's /livez to ensure the same thing.

readinessProbe is currently set to the telemetry metrics' availability in our jsonnet config, which seems okay. A more robust approach would be coupling that with the cluster's /readyz and exposing that under a dedicated /readyz (we don't have a dedicated endpoint just yet). This would mean that (a) the cluster components are ready, and (b) the binary itself is ready. I can open another PR for that.

Sounds good to me! Let's open another PR for that

One question: We already have a /healthz path, what would this /livez path then be used for?

/healthz will send out a 200 if the binary is running, /readyz will send out a 200 if the exposition machinery is working as expected (we are ready to serve metrics), and /livez will send out a 200 if none of the collectors are affected by any outages (collectors depend on the Kube API to gather data).

A /healthz endpoint would be more suitable for a startupProbe, which is especially useful if we believe the binary takes a considerable time to start.

Thanks for clarifying! Could you add this to the README and update our jsonnet as well?

dgrisonnet · 2024-06-13T16:45:52Z

/assign @mrueg
/cc @richabanker
/triage accepted

mrueg · 2024-06-17T07:27:21Z

README.md

+
+The following probes are available, and follow the [Kubernetes best practices](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes):
+
+* `livenessProbe`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint.


Suggested change

* `livenessProbe`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint.

* `/livez`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint.

mrueg · 2024-06-17T07:27:40Z

README.md

+The following probes are available, and follow the [Kubernetes best practices](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes):
+
+* `livenessProbe`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint.
+* `readinessProbe`: Checks if the application is ready to serve metrics by querying its own telemetry metrics.


Do we have one already?

mrueg · 2024-06-17T07:28:38Z

README.md

@@ -342,6 +342,13 @@ Note that your GCP identity is case sensitive but `gcloud info` as of Google Clo

 After running the above, if you see `Clusterrolebinding "cluster-admin-binding" created`, then you are able to continue with the setup of this service.

+#### Probes


Suggested change

#### Probes

#### Health endpoints

I'd prefer the former since per-port endpoints are not consistent across the binary and its telemetry expositions. Also, we don't have a dedicated /readyz but the readinessProbe makes use of the telemetry port's /metrics to determine if we are able to serve requests.

Friendly ping. :)

Added a section for endpoints as well, PTAL.

Should we remove this paragraph? I don't think it's specific to kube-state-metrics and having the endpoints documented is good enough. :)

Done, PTAL.

README.md

mrueg · 2024-06-25T08:51:53Z

README.md

@@ -342,6 +342,13 @@ Note that your GCP identity is case sensitive but `gcloud info` as of Google Clo

 After running the above, if you see `Clusterrolebinding "cluster-admin-binding" created`, then you are able to continue with the setup of this service.

+#### Probes


Should we remove this paragraph? I don't think it's specific to kube-state-metrics and having the endpoints documented is good enough. :)

Add a `livez` endpoint to identify network outages. This helps in restarting the binary if such as case is observed. Signed-off-by: Pranshu Srivastava <rexagod@gmail.com> Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

mrueg · 2024-06-25T09:16:40Z

jsonnet/kube-state-metrics/kube-state-metrics.libsonnet

@@ -193,7 +193,7 @@
      },
      livenessProbe: { timeoutSeconds: 5, initialDelaySeconds: 5, httpGet: {
        port: 8080,
-        path: '/healthz',
+        path: '/livez',
      } },
      readinessProbe: { timeoutSeconds: 5, initialDelaySeconds: 5, httpGet: {
        port: 8081,


Can you change the path below to /metrics ?

Good catch, anyone with the same config as here were relying on the self HTML page rather than the actual telemetry collectors.

mrueg · 2024-06-25T10:11:50Z

/lgtm

thanks!

k8s-ci-robot · 2024-06-25T10:11:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrueg, rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mrueg,rexagod]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 12, 2024

k8s-ci-robot requested review from dgrisonnet and logicalhan June 12, 2024 21:19

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 12, 2024

rexagod changed the title ~~enhancement: add livez endpoint~~ feat: add livez endpoint Jun 13, 2024

rexagod force-pushed the livez branch from 3a07bd0 to 4b03c21 Compare June 13, 2024 07:23

rexagod mentioned this pull request Jun 13, 2024

OCPBUGS-33620: define *probes for KSM openshift/cluster-monitoring-operator#2352

Open

2 tasks

mrueg reviewed Jun 13, 2024

View reviewed changes

k8s-ci-robot assigned mrueg Jun 13, 2024

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 13, 2024

k8s-ci-robot requested a review from richabanker June 13, 2024 16:45

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 13, 2024

mrueg added this to the v2.13.0 milestone Jun 14, 2024

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 14, 2024

rexagod force-pushed the livez branch 2 times, most recently from a608b6b to c46e08e Compare June 14, 2024 19:25

mrueg reviewed Jun 17, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 18, 2024

rexagod mentioned this pull request Jun 23, 2024

WIP: Prep 2.13.0 #2419

Open

rexagod force-pushed the livez branch from f5fed5b to 6689506 Compare June 24, 2024 06:13

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 24, 2024

mrueg reviewed Jun 25, 2024

View reviewed changes

rexagod force-pushed the livez branch from 6689506 to a89ead2 Compare June 25, 2024 09:01

enhancement: add livez endpoint

eb80c09

Add a `livez` endpoint to identify network outages. This helps in restarting the binary if such as case is observed. Signed-off-by: Pranshu Srivastava <rexagod@gmail.com> Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

rexagod force-pushed the livez branch from a89ead2 to eb80c09 Compare June 25, 2024 09:05

mrueg reviewed Jun 25, 2024

View reviewed changes

fixup! enhancement: add livez endpoint

6f8f7d1

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 25, 2024

k8s-ci-robot merged commit d862cac into kubernetes:main Jun 25, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `livez` endpoint #2418

feat: add `livez` endpoint #2418

rexagod commented Jun 12, 2024

mrueg Jun 13, 2024 •

edited

Loading

rexagod Jun 13, 2024

mrueg Jun 14, 2024

rexagod Jun 14, 2024

mrueg Jun 14, 2024

mrueg Jun 14, 2024

rexagod Jun 14, 2024

mrueg Jun 14, 2024

dgrisonnet commented Jun 13, 2024

mrueg Jun 17, 2024

mrueg Jun 17, 2024

mrueg Jun 17, 2024

rexagod Jun 17, 2024

rexagod Jun 19, 2024

rexagod Jun 24, 2024

mrueg Jun 25, 2024

rexagod Jun 25, 2024

mrueg Jun 25, 2024

mrueg Jun 25, 2024

rexagod Jun 25, 2024

mrueg commented Jun 25, 2024

k8s-ci-robot commented Jun 25, 2024


		The following probes are available, and follow the [Kubernetes best practices](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes):

		* `livenessProbe`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint.

	* `livenessProbe`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint.
	* `/livez`: Checks if the application is not affected by an outage, and is able to access the Kube API by querying the cluster's `/livez` endpoint.

		@@ -342,6 +342,13 @@ Note that your GCP identity is case sensitive but `gcloud info` as of Google Clo

		After running the above, if you see `Clusterrolebinding "cluster-admin-binding" created`, then you are able to continue with the setup of this service.

		#### Probes

feat: add livez endpoint #2418

feat: add livez endpoint #2418

Conversation

rexagod commented Jun 12, 2024

mrueg Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dgrisonnet commented Jun 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrueg commented Jun 25, 2024

k8s-ci-robot commented Jun 25, 2024

feat: add `livez` endpoint #2418

feat: add `livez` endpoint #2418

mrueg Jun 13, 2024 •

edited

Loading