-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reason
tag to kubernetes_state.job.failed
#25103
Conversation
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv create-vm --pipeline-id=33332919 --os-family=ubuntu |
63eb17c
to
3445a33
Compare
3445a33
to
4ea115a
Compare
resaon
tag to KSM and Kubelet metrics
4ea115a
to
e03bead
Compare
e03bead
to
13f3195
Compare
resaon
tag to KSM and Kubelet metricsreason
tag to KSM and Kubelet metrics
pkg/collector/corechecks/cluster/ksm/kubernetes_state_transformers.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs 👍
reason
tag to KSM and Kubelet metricsreason
tag to kubernetes_state.job.failed
@@ -421,10 +421,26 @@ func trimJobTag(tag string) (string, bool) { | |||
return trimmed, tag != trimmed | |||
} | |||
|
|||
var jobFailureReasons = map[string]struct{}{ | |||
"backofflimitexceeded": {}, | |||
"deadlineexceeded": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we update kube-state-metrics/v2
to at least v2.9.0
that contains the bug fix, kubernetes/kube-state-metrics#2046, we cannot add reason:deadlineexceeded
to kubernetes_state.job.failed
.
kube-state-metrics/v2
update should be in a different PR as the context below.
Why do currently we use k8s.io/kube-state-metrics/v2 v2.8.2?
This is because we have to align with interface change before bumping up the kube-state-metrics/v2
.
Current implementations in
https://github.com/DataDog/datadog-agent/tree/2feb83da045935df7986e56504bd297922a32ebb/pkg/collector/corechecks/cluster/ksm/customresources don't follow type RegistryFactory interface
updated by kubernetes/kube-state-metrics#1851.
if reasonTagIndex != -1 { | ||
tags = append(tags[:reasonTagIndex], tags[reasonTagIndex+1:]...) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is now in validateJob()
.
/merge |
🚂 MergeQueue Pull request added to the queue. This build is going to start soon! (estimated merge in less than 28m) Use |
What does this PR do?
Add
reason:backofflimitexceeded,deadlineexceeded
to thekubernetes_state.job.failed
.Motivation
Currently, we cannot monitor why a job failed.
Additional Notes
kube_job_status_failed
in KSM(code) is the only metric that containsreason
in labels.kube_job_failed
doesn't havereason
label, seekube_job_failed
should havereason
label kubernetes/kube-state-metrics#2382kube_pod_container_status_terminated_reason
in KSM (code) does not havereason:backofflimitexceeded,deadlineexceeded
even it can havereason:ContainerStatusUnknown
, etc.Possible Drawbacks / Trade-offs
Describe how to test/QA your changes
Collect KSM metrics from this job.