Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix casing while scraping failure reason for kube_job_status_failed #2046

Conversation

juanjjaramillo
Copy link
Contributor

@juanjjaramillo juanjjaramillo commented Apr 13, 2023

What this PR does / why we need it:
When scraping the failure reason for a job, KSM does a case-sensitive search for the keyword DeadLineExceeded. Since Kubernetes is generating the string DeadlineExceeded, the search fails to scrape the reason. This PR proposes to do a case-insensitive search instead, so changes in casing do not affect the scraping of the reason.

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
Does not change cardinality

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2045

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Apr 13, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: juanjjaramillo / name: Juan Jose Jaramillo (2990fd5)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 13, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

Welcome @juanjjaramillo!

It looks like this is your first PR to kubernetes/kube-state-metrics 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kube-state-metrics has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 13, 2023
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Apr 13, 2023
@CatherineF-dev
Copy link
Contributor

This metric is a stable metric. https://github.com/kubernetes/kube-state-metrics/blob/main/docs/job-metrics.md

Changing this might break alerts using this reason.

cc @dgrisonnet do we allow this change for stable metrics?

@mrueg
Copy link
Member

mrueg commented Apr 14, 2023

Was it ever DeadLineExceeded in upstream kubernetes? If not this seems like a bug in kubestatemetrics

@dgrisonnet
Copy link
Member

Yes this is a bug, I believe it was always DeadlineExceeded, but we might have made a typo when introducing the change.

@juanjjaramillo could you perhaps also update the value in https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/job.go#L42?

@dgrisonnet
Copy link
Member

@CatherineF-dev yes we do. It is very rare that label value changes are breaking changes and even if they are I think we should always go forward with them, but if there is any doubt, we should always call the change out in the changelog/release note.

Moreover in this particular case it fixes an actual bug that prevent the actual reason to be exposed, so we should definitely fix it.

@@ -429,5 +430,5 @@ func failureReason(jc *v1batch.JobCondition, reason string) bool {
if jc == nil {
return false
}
return jc.Reason == reason
return strings.EqualFold(jc.Reason, reason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to map the value exactly to make sure that we have a bounded list of potential values, so I don't think we should make that case insensitive. We should rather update the list of reasons and make sure we are more careful in the future to avoid these kind of typos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense @dgrisonnet

I have already undo this change. Casing in https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/job.go#L42 was already part of the previous commit, so I only needed to update unit testing to make tests pass. Please let me know if something else is needed.

Thank you for the review!

@juanjjaramillo juanjjaramillo changed the title Ignore casing while scraping failure reason for kube_job_status_failed Fix casing while scraping failure reason for kube_job_status_failed Apr 14, 2023
Copy link
Member

@dgrisonnet dgrisonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @juanjjaramillo! :)

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 14, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgrisonnet, juanjjaramillo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 14, 2023
@k8s-ci-robot k8s-ci-robot merged commit 168254d into kubernetes:main Apr 14, 2023
@juanjjaramillo juanjjaramillo deleted the juanjjaramillo/fix_job_failure_reason branch April 14, 2023 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scraping the failure reason for a job does a case-sensitive search that fails
5 participants