Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reason tag to kubernetes_state.job.failed #25103

Merged
merged 9 commits into from
May 9, 2024

Conversation

keisku
Copy link
Contributor

@keisku keisku commented Apr 25, 2024

What does this PR do?

Add reason:backofflimitexceeded,deadlineexceeded to the kubernetes_state.job.failed.

Motivation

Currently, we cannot monitor why a job failed.

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Collect KSM metrics from this job.

apiVersion: batch/v1
kind: Job
metadata:
  name: exit-fail
spec:
  backoffLimit: 2
  template:
    spec:
      containers:
      - name: busybox
        image: busybox
        command: ["exit", "1"]
      restartPolicy: OnFailure

2024-05-01_21-30-52

@keisku keisku requested a review from a team as a code owner April 25, 2024 02:24
@pr-commenter
Copy link

pr-commenter bot commented Apr 25, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=33332919 --os-family=ubuntu

@keisku keisku requested a review from a team as a code owner April 27, 2024 00:14
@keisku keisku changed the title Add BackoffLimitExceeded, DeadlineExceeded as reason tag Add resaon tag to KSM and Kubelet metrics Apr 27, 2024
@keisku keisku added this to the 7.54.0 milestone Apr 27, 2024
@keisku keisku changed the title Add resaon tag to KSM and Kubelet metrics Add reason tag to KSM and Kubelet metrics Apr 27, 2024
@L3n41c L3n41c modified the milestones: 7.54.0, 7.55.0 Apr 29, 2024
Copy link
Contributor

@hestonhoffman hestonhoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs 👍

@keisku keisku changed the title Add reason tag to KSM and Kubelet metrics Add reason tag to kubernetes_state.job.failed May 1, 2024
@@ -421,10 +421,26 @@ func trimJobTag(tag string) (string, bool) {
return trimmed, tag != trimmed
}

var jobFailureReasons = map[string]struct{}{
"backofflimitexceeded": {},
"deadlineexceeded": {},
Copy link
Contributor Author

@keisku keisku May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we update kube-state-metrics/v2 to at least v2.9.0 that contains the bug fix, kubernetes/kube-state-metrics#2046, we cannot add reason:deadlineexceeded to kubernetes_state.job.failed.

kube-state-metrics/v2 update should be in a different PR as the context below.

Why do currently we use k8s.io/kube-state-metrics/v2 v2.8.2?
This is because we have to align with interface change before bumping up the kube-state-metrics/v2.

Current implementations in
https://github.com/DataDog/datadog-agent/tree/2feb83da045935df7986e56504bd297922a32ebb/pkg/collector/corechecks/cluster/ksm/customresources don't follow type RegistryFactory interface updated by kubernetes/kube-state-metrics#1851.

Comment on lines -485 to -488
if reasonTagIndex != -1 {
tags = append(tags[:reasonTagIndex], tags[reasonTagIndex+1:]...)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is now in validateJob().

@keisku
Copy link
Contributor Author

keisku commented May 9, 2024

/merge

@dd-devflow
Copy link

dd-devflow bot commented May 9, 2024

🚂 MergeQueue

Pull request added to the queue.

This build is going to start soon! (estimated merge in less than 28m)

Use /merge -c to cancel this operation!

@dd-mergequeue dd-mergequeue bot merged commit 4faaee8 into main May 9, 2024
203 checks passed
@dd-mergequeue dd-mergequeue bot deleted the keisku/support1663984 branch May 9, 2024 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants