OCPBUGS-33620: define *probes for KSM #2352

rexagod · 2024-05-16T03:01:38Z

Will set a target version post-4.16 branching.

I added CHANGELOG entry for this change.
No user facing changes, so no entry in CHANGELOG was needed.

openshift-ci-robot · 2024-05-16T03:01:43Z

@rexagod: This pull request references Jira Issue OCPBUGS-33620, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

rexagod · 2024-05-16T03:01:49Z

jsonnet/components/kube-state-metrics.libsonnet

+                        },
+                      ],
+                      local livenessProbePath = 'healthz',
+                      local readinessProbePath = '',


https://github.com/kubernetes/kube-state-metrics/blob/6de105ebbe0eeb5d97d9e6adf1cc83314434e8cc/jsonnet/kube-state-metrics/kube-state-metrics.libsonnet#L200

openshift-ci · 2024-05-16T03:02:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rexagod]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rexagod · 2024-05-16T03:12:44Z

jsonnet/components/kube-state-metrics.libsonnet

@@ -266,6 +266,38 @@ function(params)
                          readOnly: true,
                        },
                      ],
+                      local metricsPort = 8081,
+                      local selfPort = 8082,
+                      ports::: [


::: since kube-prometheus hides these fields.

juzhao · 2024-05-30T07:11:37Z

/jira refresh

openshift-ci-robot · 2024-05-30T07:11:42Z

@juzhao: This pull request references Jira Issue OCPBUGS-33620, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

simonpasquier

Exec probes have lots of edge cases and I'd rather stay away from them.
Also are we convinced that a liveness probe would have caught the network issue?

rexagod · 2024-06-04T08:10:08Z

There were a couple of asks in the original ticket.

Being able to "figure out" network outages.
Being able to restart the kube-state-metrics container if that happens.

I believe for the second case, exec probes may be the only way since kubelet pings are blocked off by KRP. In other cases, we've used HTTP-based probes for the outer KRP instance, but a non-200 status there will end up restarting the KRP instance, and not the proxied container.

For the first case, I believe we should keep the endpoints here as they are defined upstream, but make sure that /healthz (in upstream KSM) considers network outages before sending a 200 (querying a known resource type for metrics, for instance, or maybe a more suited endpoint). This way, liveness will restart the pod if there's an outage, while readiness will query :8082 to determine if the pod's good for traffic (so no changes there).

IMHO I agree and would prefer HTTP-based approach over this given the edge cases, but this seems to check all the asks, unless I'm missing something here.

simonpasquier · 2024-06-04T09:52:12Z

For the first case, I believe we should keep the endpoints here as they are defined upstream, but make sure that /healthz (in upstream KSM) considers network outages before sending a 200 (querying a known resource type for metrics, for instance, or maybe a more suited endpoint). This way, liveness will restart the pod if there's an outage, while readiness will query :8082 to determine if the pod's good for traffic (so no changes there).

https://github.com/kubernetes/kube-state-metrics/blob/a4ddfe6ed901c80b96e1dfab00a2df1820925b53/pkg/app/server.go#L405-L409

AFAICT /healthz always return 200 OK. I'd be on the fence adding checks against Kubernetes API to the /healthz endpoint: strictly speaking kube-state-metrics is still alive even if it can't reach the API. But it's an upstream discussion anyway.

rexagod · 2024-06-04T15:02:38Z

Right, and to make this a non-breaking change, would it make sense to add outage checks to a /livez endpoint instead? If so, I can pitch this upstream.

simonpasquier · 2024-06-04T15:08:29Z

Checking against what metrics-server does:

/livez reports whether the process is dead-locked or not.
/healthz reports whether all dependencies are operational.
/readyz reports that the process is ready to serve HTTP requests.

Even to avoid a breaking change, I'm not sure that you want to swap the semantics of /livez and /healthz?

rexagod · 2024-06-06T05:54:59Z

Hmm, there seems to be a similar pattern (eg., /healthz being influenced by dependencies) across components that are coupled with API-server (controller-manager) or aggregate-API-server (metrics-server).

But for out-of-tree non-apiserver-dependent projects like usage-metrics-collector, the /healthz seems to be in line with KSM, and seems to choose controller-runtime's style over apiserver's.

I believe this pattern would warrant for including outage checks within /livez, while making it responsible for looking out for any non-healing symptoms (i.e., outages) which would demand a restart.

WDYT?

rexagod · 2024-06-11T16:13:36Z

(PS. I might be wrong here, but I wanted to confirm my deduction to devise my approach to resolve this upstream)

simonpasquier

As I said above, I've been burnt too many times by exec probes. Can we configure kube-rbac-proxy to disable authn/authz for probes?

simonpasquier · 2024-06-13T10:20:16Z

assets/kube-state-metrics/deployment.yaml

+            command:
+            - sh
+            - -c
+            - if [ -x "$(command -v curl)" ]; then exec curl -s -I -f http://localhost:8081/livez; elif [ -x "$(command -v wget)" ]; then exec wget --quiet --tries=1 --spider http://localhost:8081/livez; else exit 1; fi


/livez returns 200 OK currently because the HTTP server returns the home URL. I'd rather have the /livez endpoint implemented before having the change here.

I've raised a PR for livez here (https://github.com/openshift/cluster-monitoring-operator/pull/2352/files#diff-81cd935cb1786fb907c3e0b33e0ebfde1a209b916aba70217b4a836992a1808aR281): kubernetes/kube-state-metrics#2418.

rexagod · 2024-06-13T17:00:39Z

assets/kube-state-metrics/deployment.yaml

+        livenessProbe:
+          httpGet:
+            path: livez
+            port: 8443
+            scheme: HTTPS


(point livenessProbe for the KSM container to it's respective proxied port, this helps us actually restart the container on a probe failure)

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

rexagod · 2024-06-13T20:43:24Z

/test e2e-aws-ovn-techpreview

rexagod · 2024-06-25T13:38:55Z

kubernetes/kube-state-metrics#2418 (comment) is in, and will be included in kubernetes/kube-state-metrics#2419 (v2.13.0).

rexagod · 2024-06-25T13:43:47Z

Pinging @simonpasquier for one last review here to make sure everything looks good before we release KSMv2.13.0, so all that will be left after that would be merging this.

openshift-ci · 2024-07-01T20:15:41Z

@rexagod: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

simonpasquier · 2024-07-03T12:39:44Z

assets/kube-state-metrics/deployment.yaml

@@ -108,6 +124,7 @@ spec:
        - --tls-private-key-file=/etc/tls/private/tls.key
        - --client-ca-file=/etc/tls/client/client-ca.crt
        - --config-file=/etc/kube-rbac-policy/config.yaml
+        - --ignore-paths=/metrics


We ignore /metrics here because it's used by the readiness probe? It doesn't feel correct since it would mean that any in-cluster actor can read the KSM metrics too?

Right, even though we've relied on telemetry metrics to check for readiness upstream, nonetheless, exposing that information is susceptible to malicious intent. I'll patch upstream to restrict that information from being exposed and add a dedicated /readyz.

On second thought, it'd be better to not rely on telemetry endpoints at all and base all probes off of the actual KSM resource metrics server.

Opened kubernetes/kube-state-metrics#2442.

simonpasquier

I don't understand how the e2e tests pass given that the /livez endpoint doesn't exist.

simonpasquier · 2024-07-03T12:45:12Z

assets/kube-state-metrics/deployment.yaml

@@ -57,7 +57,22 @@ spec:
          ^kube_pod_completion_time$,
          ^kube_pod_status_scheduled$
        image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.12.0
+        livenessProbe:
+          httpGet:
+            path: livez


should be /livez?

May be omitted, I'll patch this up though for it to be consistent with other paths.

apiVersion: apps/v1 kind: Deployment metadata: name: sample-deployment namespace: default labels: app: sample-app spec: replicas: 3 selector: matchLabels: app: sample-app template: metadata: labels: app: sample-app spec: ... - containerPort: 80 livenessProbe: httpGet: path: healthfoo port: 80 ...

Containers: sample-container: Container ID: cri-o://736a33ba57ec594200107098015e4d0fd38c03d24e46e65b359bd83e43172095 Image: nginx:latest Image ID: docker.io/library/nginx@sha256:67682bda769fae1ccf5183192b8daf37b64cae99c6c3302650f6f8bf5f0f95df Port: 80/TCP Host Port: 0/TCP State: Running Started: Wed, 03 Jul 2024 23:08:25 +0530 Ready: False Restart Count: 0 Liveness: http-get http://:80/healthfoo delay=15s timeout=2s period=5s #success=1 #failure=3 Readiness: http-get http://:80/readyz delay=5s timeout=2s period=5s #success=1 #failure=3

openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 16, 2024

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 16, 2024

rexagod commented May 16, 2024

View reviewed changes

openshift-ci bot requested review from danielmellado and slashpai May 16, 2024 03:02

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 16, 2024

rexagod commented May 16, 2024

View reviewed changes

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 30, 2024

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 30, 2024

openshift-ci bot requested a review from juzhao May 30, 2024 07:11

simonpasquier reviewed Jun 3, 2024

View reviewed changes

rexagod force-pushed the 33620 branch 2 times, most recently from 0d8a1d8 to 312fa45 Compare June 13, 2024 07:13

simonpasquier reviewed Jun 13, 2024

View reviewed changes

rexagod force-pushed the 33620 branch from 312fa45 to 862eb57 Compare June 13, 2024 16:54

rexagod commented Jun 13, 2024

View reviewed changes

rexagod force-pushed the 33620 branch from 862eb57 to 7c801ea Compare June 13, 2024 17:09

ksm: add liveness and readiness probes

2e35d00

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

rexagod force-pushed the 33620 branch from 7c801ea to 2e35d00 Compare June 13, 2024 17:11

rexagod requested a review from simonpasquier July 2, 2024 15:24

simonpasquier reviewed Jul 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-33620: define *probes for KSM #2352

OCPBUGS-33620: define *probes for KSM #2352

rexagod commented May 16, 2024 •

edited

Loading

openshift-ci-robot commented May 16, 2024

rexagod May 16, 2024

openshift-ci bot commented May 16, 2024

rexagod May 16, 2024

juzhao commented May 30, 2024

openshift-ci-robot commented May 30, 2024

simonpasquier left a comment

rexagod commented Jun 4, 2024 •

edited

Loading

simonpasquier commented Jun 4, 2024

rexagod commented Jun 4, 2024

simonpasquier commented Jun 4, 2024

rexagod commented Jun 6, 2024

rexagod commented Jun 11, 2024 •

edited

Loading

simonpasquier left a comment

simonpasquier Jun 13, 2024

rexagod Jun 13, 2024 •

edited

Loading

rexagod Jun 13, 2024

rexagod commented Jun 13, 2024

rexagod commented Jun 25, 2024

rexagod commented Jun 25, 2024 •

edited

Loading

openshift-ci bot commented Jul 1, 2024

simonpasquier Jul 3, 2024 •

edited

Loading

rexagod Jul 3, 2024 •

edited

Loading

rexagod Jul 3, 2024

rexagod Jul 3, 2024

simonpasquier left a comment

simonpasquier Jul 3, 2024

rexagod Jul 3, 2024

OCPBUGS-33620: define *probes for KSM #2352

Are you sure you want to change the base?

OCPBUGS-33620: define *probes for KSM #2352

Conversation

rexagod commented May 16, 2024 • edited Loading

openshift-ci-robot commented May 16, 2024

Choose a reason for hiding this comment

openshift-ci bot commented May 16, 2024

Choose a reason for hiding this comment

juzhao commented May 30, 2024

openshift-ci-robot commented May 30, 2024

simonpasquier left a comment

Choose a reason for hiding this comment

rexagod commented Jun 4, 2024 • edited Loading

simonpasquier commented Jun 4, 2024

rexagod commented Jun 4, 2024

simonpasquier commented Jun 4, 2024

rexagod commented Jun 6, 2024

rexagod commented Jun 11, 2024 • edited Loading

simonpasquier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rexagod Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rexagod commented Jun 13, 2024

rexagod commented Jun 25, 2024

rexagod commented Jun 25, 2024 • edited Loading

openshift-ci bot commented Jul 1, 2024

simonpasquier Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

rexagod Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonpasquier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rexagod commented May 16, 2024 •

edited

Loading

rexagod commented Jun 4, 2024 •

edited

Loading

rexagod commented Jun 11, 2024 •

edited

Loading

rexagod Jun 13, 2024 •

edited

Loading

rexagod commented Jun 25, 2024 •

edited

Loading

simonpasquier Jul 3, 2024 •

edited

Loading

rexagod Jul 3, 2024 •

edited

Loading