Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need metric to return job failed reason #947

Closed
tatchiuleung opened this issue Oct 11, 2019 · 11 comments · Fixed by #1214
Closed

Need metric to return job failed reason #947

tatchiuleung opened this issue Oct 11, 2019 · 11 comments · Fixed by #1214
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@tatchiuleung
Copy link

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:
Missing a metric like kube_job_status_failed_reason
What you expected to happen:
Add kube_job_status_failed_reason to display the failed reason, such as Evicted
How to reproduce it (as minimally and precisely as possible):
Create a cronjob but limit the resource

Anything else we need to know?:
K8S returns the following in the Evicted job

status:
  message: 'The node was low on resource: ephemeral-storage. Container backups was
    using 3405160Ki, which exceeds its request of 0. '
  phase: Failed
  reason: Evicted
  startTime: "2019-10-11T00:00:02Z"

We would like to filter out the Evicted jobs, even if it's failed.
Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.1", GitCommit:"d647ddbd755faf07169599a625faf302ffc34458", GitTreeState:"clean", BuildDate:"2019-10-02T23:49:07Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.7-gke.24", GitCommit:"2ce02ef1754a457ba464ab87dba9090d90cf0468", GitTreeState:"clean", BuildDate:"2019-08-12T22:05:28Z", GoVersion:"go1.11.5b4", Compiler:"gc", Platform:"linux/amd64"}
  • Kube-state-metrics image version
quay.io/coreos/kube-state-metrics:v1.7.2
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 11, 2019
@AdityaMisra
Copy link
Contributor

AdityaMisra commented Oct 30, 2019

@brancz @lilic I want to work on this issue.

@brancz
Copy link
Member

brancz commented Oct 30, 2019

If there is a bound number of reasons I think this would be ok to have.

@lilic
Copy link
Member

lilic commented Oct 30, 2019

@AdityaMisra do you mind looking up the number of reasons first for failed jobs, so we can make sure does not have a high unbound cardinality. Otherwise go ahead. :)

/assign @AdityaMisra

@juliantaylor
Copy link

This is also needed for normal pods, not only jobs.
maybe via a label with kube_pod_status_phase

@lilic
Copy link
Member

lilic commented Nov 15, 2019

@juliantaylor Feel free to open a separate issue for the pod, as its a different resource makes sense to discuss that separately.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2020
@AdityaMisra
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2020
@RajatVaryani
Copy link

@AdityaMisra Are you still working on this?

@AdityaMisra
Copy link
Contributor

@RajatVaryani I'm working on it. Will be generating PR soon.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2020
@lilic
Copy link
Member

lilic commented Jul 27, 2020

/remove-lifecycle stale

There is a PR for this.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment