Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-33620: define *probes for KSM #2352

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rexagod
Copy link
Member

@rexagod rexagod commented May 16, 2024

Will set a target version post-4.16 branching.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 16, 2024
@openshift-ci-robot
Copy link
Contributor

@rexagod: This pull request references Jira Issue OCPBUGS-33620, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 16, 2024
},
],
local livenessProbePath = 'healthz',
local readinessProbePath = '',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

openshift-ci bot commented May 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 16, 2024
@@ -266,6 +266,38 @@ function(params)
readOnly: true,
},
],
local metricsPort = 8081,
local selfPort = 8082,
ports::: [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juzhao
Copy link

juzhao commented May 30, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label May 30, 2024
@openshift-ci-robot
Copy link
Contributor

@juzhao: This pull request references Jira Issue OCPBUGS-33620, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 30, 2024
@openshift-ci openshift-ci bot requested a review from juzhao May 30, 2024 07:11
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exec probes have lots of edge cases and I'd rather stay away from them.
Also are we convinced that a liveness probe would have caught the network issue?

@rexagod
Copy link
Member Author

rexagod commented Jun 4, 2024

There were a couple of asks in the original ticket.

  • Being able to "figure out" network outages.
  • Being able to restart the kube-state-metrics container if that happens.

I believe for the second case, exec probes may be the only way since kubelet pings are blocked off by KRP. In other cases, we've used HTTP-based probes for the outer KRP instance, but a non-200 status there will end up restarting the KRP instance, and not the proxied container.

For the first case, I believe we should keep the endpoints here as they are defined upstream, but make sure that /healthz (in upstream KSM) considers network outages before sending a 200 (querying a known resource type for metrics, for instance, or maybe a more suited endpoint). This way, liveness will restart the pod if there's an outage, while readiness will query :8082 to determine if the pod's good for traffic (so no changes there).

IMHO I agree and would prefer HTTP-based approach over this given the edge cases, but this seems to check all the asks, unless I'm missing something here.

@simonpasquier
Copy link
Contributor

For the first case, I believe we should keep the endpoints here as they are defined upstream, but make sure that /healthz (in upstream KSM) considers network outages before sending a 200 (querying a known resource type for metrics, for instance, or maybe a more suited endpoint). This way, liveness will restart the pod if there's an outage, while readiness will query :8082 to determine if the pod's good for traffic (so no changes there).

https://github.com/kubernetes/kube-state-metrics/blob/a4ddfe6ed901c80b96e1dfab00a2df1820925b53/pkg/app/server.go#L405-L409

AFAICT /healthz always return 200 OK. I'd be on the fence adding checks against Kubernetes API to the /healthz endpoint: strictly speaking kube-state-metrics is still alive even if it can't reach the API. But it's an upstream discussion anyway.

@rexagod
Copy link
Member Author

rexagod commented Jun 4, 2024

Right, and to make this a non-breaking change, would it make sense to add outage checks to a /livez endpoint instead? If so, I can pitch this upstream.

@simonpasquier
Copy link
Contributor

Checking against what metrics-server does:

  • /livez reports whether the process is dead-locked or not.
  • /healthz reports whether all dependencies are operational.
  • /readyz reports that the process is ready to serve HTTP requests.

Even to avoid a breaking change, I'm not sure that you want to swap the semantics of /livez and /healthz?

@rexagod
Copy link
Member Author

rexagod commented Jun 6, 2024

Hmm, there seems to be a similar pattern (eg., /healthz being influenced by dependencies) across components that are coupled with API-server (controller-manager) or aggregate-API-server (metrics-server).

But for out-of-tree non-apiserver-dependent projects like usage-metrics-collector, the /healthz seems to be in line with KSM, and seems to choose controller-runtime's style over apiserver's.

I believe this pattern would warrant for including outage checks within /livez, while making it responsible for looking out for any non-healing symptoms (i.e., outages) which would demand a restart.

WDYT?

@rexagod
Copy link
Member Author

rexagod commented Jun 11, 2024

(PS. I might be wrong here, but I wanted to confirm my deduction to devise my approach to resolve this upstream)

@rexagod rexagod force-pushed the 33620 branch 2 times, most recently from 0d8a1d8 to 312fa45 Compare June 13, 2024 07:13
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said above, I've been burnt too many times by exec probes. Can we configure kube-rbac-proxy to disable authn/authz for probes?

command:
- sh
- -c
- if [ -x "$(command -v curl)" ]; then exec curl -s -I -f http://localhost:8081/livez; elif [ -x "$(command -v wget)" ]; then exec wget --quiet --tries=1 --spider http://localhost:8081/livez; else exit 1; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/livez returns 200 OK currently because the HTTP server returns the home URL. I'd rather have the /livez endpoint implemented before having the change here.

Copy link
Member Author

@rexagod rexagod Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 60 to 64
livenessProbe:
httpGet:
path: livez
port: 8443
scheme: HTTPS
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(point livenessProbe for the KSM container to it's respective proxied port, this helps us actually restart the container on a probe failure)

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
@rexagod
Copy link
Member Author

rexagod commented Jun 13, 2024

/test e2e-aws-ovn-techpreview

@rexagod
Copy link
Member Author

rexagod commented Jun 25, 2024

kubernetes/kube-state-metrics#2418 (comment) is in, and will be included in kubernetes/kube-state-metrics#2419 (v2.13.0).

@rexagod
Copy link
Member Author

rexagod commented Jun 25, 2024

Pinging @simonpasquier for one last review here to make sure everything looks good before we release KSMv2.13.0, so all that will be left after that would be merging this.

Copy link
Contributor

openshift-ci bot commented Jul 1, 2024

@rexagod: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -108,6 +124,7 @@ spec:
- --tls-private-key-file=/etc/tls/private/tls.key
- --client-ca-file=/etc/tls/client/client-ca.crt
- --config-file=/etc/kube-rbac-policy/config.yaml
- --ignore-paths=/metrics
Copy link
Contributor

@simonpasquier simonpasquier Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ignore /metrics here because it's used by the readiness probe? It doesn't feel correct since it would mean that any in-cluster actor can read the KSM metrics too?

Copy link
Member Author

@rexagod rexagod Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, even though we've relied on telemetry metrics to check for readiness upstream, nonetheless, exposing that information is susceptible to malicious intent. I'll patch upstream to restrict that information from being exposed and add a dedicated /readyz.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, it'd be better to not rely on telemetry endpoints at all and base all probes off of the actual KSM resource metrics server.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how the e2e tests pass given that the /livez endpoint doesn't exist.

@@ -57,7 +57,22 @@ spec:
^kube_pod_completion_time$,
^kube_pod_status_scheduled$
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.12.0
livenessProbe:
httpGet:
path: livez
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be /livez?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be omitted, I'll patch this up though for it to be consistent with other paths.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-deployment
  namespace: default
  labels:
    app: sample-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
    spec:
		...
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: healthfoo
            port: 80
			...
Containers:
  sample-container:
    Container ID:   cri-o://736a33ba57ec594200107098015e4d0fd38c03d24e46e65b359bd83e43172095
    Image:          nginx:latest
    Image ID:       docker.io/library/nginx@sha256:67682bda769fae1ccf5183192b8daf37b64cae99c6c3302650f6f8bf5f0f95df
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 03 Jul 2024 23:08:25 +0530
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:80/healthfoo delay=15s timeout=2s period=5s #success=1 #failure=3
    Readiness:      http-get http://:80/readyz delay=5s timeout=2s period=5s #success=1 #failure=3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants