-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-33620: define *probes for KSM #2352
base: master
Are you sure you want to change the base?
Conversation
@rexagod: This pull request references Jira Issue OCPBUGS-33620, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
}, | ||
], | ||
local livenessProbePath = 'healthz', | ||
local readinessProbePath = '', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rexagod The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -266,6 +266,38 @@ function(params) | |||
readOnly: true, | |||
}, | |||
], | |||
local metricsPort = 8081, | |||
local selfPort = 8082, | |||
ports::: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:::
since kube-prometheus hides these fields.
/jira refresh |
@juzhao: This pull request references Jira Issue OCPBUGS-33620, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exec probes have lots of edge cases and I'd rather stay away from them.
Also are we convinced that a liveness probe would have caught the network issue?
There were a couple of asks in the original ticket.
I believe for the second case, exec probes may be the only way since kubelet pings are blocked off by KRP. In other cases, we've used HTTP-based probes for the outer KRP instance, but a non- For the first case, I believe we should keep the endpoints here as they are defined upstream, but make sure that IMHO I agree and would prefer HTTP-based approach over this given the edge cases, but this seems to check all the asks, unless I'm missing something here. |
AFAICT /healthz always return 200 OK. I'd be on the fence adding checks against Kubernetes API to the /healthz endpoint: strictly speaking kube-state-metrics is still alive even if it can't reach the API. But it's an upstream discussion anyway. |
Right, and to make this a non-breaking change, would it make sense to add outage checks to a |
Checking against what metrics-server does:
Even to avoid a breaking change, I'm not sure that you want to swap the semantics of /livez and /healthz? |
Hmm, there seems to be a similar pattern (eg., But for out-of-tree non- I believe this pattern would warrant for including outage checks within WDYT? |
(PS. I might be wrong here, but I wanted to confirm my deduction to devise my approach to resolve this upstream) |
0d8a1d8
to
312fa45
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said above, I've been burnt too many times by exec probes. Can we configure kube-rbac-proxy to disable authn/authz for probes?
command: | ||
- sh | ||
- -c | ||
- if [ -x "$(command -v curl)" ]; then exec curl -s -I -f http://localhost:8081/livez; elif [ -x "$(command -v wget)" ]; then exec wget --quiet --tries=1 --spider http://localhost:8081/livez; else exit 1; fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/livez returns 200 OK currently because the HTTP server returns the home URL. I'd rather have the /livez endpoint implemented before having the change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
livenessProbe: | ||
httpGet: | ||
path: livez | ||
port: 8443 | ||
scheme: HTTPS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(point livenessProbe
for the KSM container to it's respective proxied port, this helps us actually restart the container on a probe failure)
Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
/test e2e-aws-ovn-techpreview |
kubernetes/kube-state-metrics#2418 (comment) is in, and will be included in kubernetes/kube-state-metrics#2419 (v2.13.0). |
Pinging @simonpasquier for one last review here to make sure everything looks good before we release KSMv2.13.0, so all that will be left after that would be merging this. |
@rexagod: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@@ -108,6 +124,7 @@ spec: | |||
- --tls-private-key-file=/etc/tls/private/tls.key | |||
- --client-ca-file=/etc/tls/client/client-ca.crt | |||
- --config-file=/etc/kube-rbac-policy/config.yaml | |||
- --ignore-paths=/metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ignore /metrics here because it's used by the readiness probe? It doesn't feel correct since it would mean that any in-cluster actor can read the KSM metrics too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, even though we've relied on telemetry metrics to check for readiness upstream, nonetheless, exposing that information is susceptible to malicious intent. I'll patch upstream to restrict that information from being exposed and add a dedicated /readyz
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, it'd be better to not rely on telemetry endpoints at all and base all probes off of the actual KSM resource metrics server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how the e2e tests pass given that the /livez endpoint doesn't exist.
@@ -57,7 +57,22 @@ spec: | |||
^kube_pod_completion_time$, | |||
^kube_pod_status_scheduled$ | |||
image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.12.0 | |||
livenessProbe: | |||
httpGet: | |||
path: livez |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be /livez?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be omitted, I'll patch this up though for it to be consistent with other paths.
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-deployment
namespace: default
labels:
app: sample-app
spec:
replicas: 3
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
spec:
...
- containerPort: 80
livenessProbe:
httpGet:
path: healthfoo
port: 80
...
Containers:
sample-container:
Container ID: cri-o://736a33ba57ec594200107098015e4d0fd38c03d24e46e65b359bd83e43172095
Image: nginx:latest
Image ID: docker.io/library/nginx@sha256:67682bda769fae1ccf5183192b8daf37b64cae99c6c3302650f6f8bf5f0f95df
Port: 80/TCP
Host Port: 0/TCP
State: Running
Started: Wed, 03 Jul 2024 23:08:25 +0530
Ready: False
Restart Count: 0
Liveness: http-get http://:80/healthfoo delay=15s timeout=2s period=5s #success=1 #failure=3
Readiness: http-get http://:80/readyz delay=5s timeout=2s period=5s #success=1 #failure=3
Will set a target version post-4.16 branching.