Need metric to return job failed reason #947

tatchiuleung · 2019-10-11T19:30:56Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

What happened:
Missing a metric like kube_job_status_failed_reason
What you expected to happen:
Add kube_job_status_failed_reason to display the failed reason, such as Evicted
How to reproduce it (as minimally and precisely as possible):
Create a cronjob but limit the resource

Anything else we need to know?:
K8S returns the following in the Evicted job

status:
  message: 'The node was low on resource: ephemeral-storage. Container backups was
    using 3405160Ki, which exceeds its request of 0. '
  phase: Failed
  reason: Evicted
  startTime: "2019-10-11T00:00:02Z"

We would like to filter out the Evicted jobs, even if it's failed.
Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.1", GitCommit:"d647ddbd755faf07169599a625faf302ffc34458", GitTreeState:"clean", BuildDate:"2019-10-02T23:49:07Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.7-gke.24", GitCommit:"2ce02ef1754a457ba464ab87dba9090d90cf0468", GitTreeState:"clean", BuildDate:"2019-08-12T22:05:28Z", GoVersion:"go1.11.5b4", Compiler:"gc", Platform:"linux/amd64"}

Kube-state-metrics image version

quay.io/coreos/kube-state-metrics:v1.7.2

The text was updated successfully, but these errors were encountered:

AdityaMisra · 2019-10-30T09:01:16Z

@brancz @lilic I want to work on this issue.

brancz · 2019-10-30T09:22:21Z

If there is a bound number of reasons I think this would be ok to have.

lilic · 2019-10-30T11:05:58Z

@AdityaMisra do you mind looking up the number of reasons first for failed jobs, so we can make sure does not have a high unbound cardinality. Otherwise go ahead. :)

/assign @AdityaMisra

juliantaylor · 2019-11-15T08:37:04Z

This is also needed for normal pods, not only jobs.
maybe via a label with kube_pod_status_phase

lilic · 2019-11-15T11:05:43Z

@juliantaylor Feel free to open a separate issue for the pod, as its a different resource makes sense to discuss that separately.

fejta-bot · 2020-02-13T11:12:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

AdityaMisra · 2020-02-13T11:34:52Z

/remove-lifecycle stale

RajatVaryani · 2020-04-28T07:29:04Z

@AdityaMisra Are you still working on this?

AdityaMisra · 2020-04-28T07:33:08Z

@RajatVaryani I'm working on it. Will be generating PR soon.

fejta-bot · 2020-07-27T08:32:56Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

lilic · 2020-07-27T09:15:23Z

/remove-lifecycle stale

There is a PR for this.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 11, 2019

k8s-ci-robot assigned AdityaMisra Oct 30, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2020

AdityaMisra mentioned this issue Jul 26, 2020

Added the ability to fetch the kube_job_status_failed_reason #1184

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 27, 2020

This was referenced Aug 28, 2020

Added the job failure reason in kube_job_status_failed metric #1213

Closed

Added the job failure reason in kube_job_status_failed metric #1214

Merged

k8s-ci-robot closed this as completed in #1214 Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need metric to return job failed reason #947

Need metric to return job failed reason #947

tatchiuleung commented Oct 11, 2019

AdityaMisra commented Oct 30, 2019 •

edited

Loading

brancz commented Oct 30, 2019

lilic commented Oct 30, 2019

juliantaylor commented Nov 15, 2019

lilic commented Nov 15, 2019

fejta-bot commented Feb 13, 2020

AdityaMisra commented Feb 13, 2020

RajatVaryani commented Apr 28, 2020

AdityaMisra commented Apr 28, 2020

fejta-bot commented Jul 27, 2020

lilic commented Jul 27, 2020

Need metric to return job failed reason #947

Need metric to return job failed reason #947

Comments

tatchiuleung commented Oct 11, 2019

AdityaMisra commented Oct 30, 2019 • edited Loading

brancz commented Oct 30, 2019

lilic commented Oct 30, 2019

juliantaylor commented Nov 15, 2019

lilic commented Nov 15, 2019

fejta-bot commented Feb 13, 2020

AdityaMisra commented Feb 13, 2020

RajatVaryani commented Apr 28, 2020

AdityaMisra commented Apr 28, 2020

fejta-bot commented Jul 27, 2020

lilic commented Jul 27, 2020

AdityaMisra commented Oct 30, 2019 •

edited

Loading