Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Terminating" status in kube_pod_status_phase metrics #1013

Conversation

clamoriniere
Copy link
Contributor

@clamoriniere clamoriniere commented Dec 24, 2019

What this PR does:

Aim of this PullRequest is to add several new pod metrics:

  • kube_pod_deleted: Unix deletion timestamp
  • kube_pod_status_reason: The pod status reasons (NodeLost, Evicted)
    These new metrics can be used to determine the pod.status.phase
    displayed by kubectl: Running, Terminated, Unknown…
    It will allow removing the kubectl display logic added previously
    to compute the “Unknown” phase.

{phase == v1.PodUnknown || (p.DeletionTimestamp != nil && p.Status.Reason == nodeUnreachablePodReason), string(v1.PodUnknown)},

why we need it:

In some cases, a Pod can be stuck in the "Terminating" phase due to a "Kubelet" issue; for example: the Kubelet is not able to communicate with the container runtime, or the container runtime is not able to delete the associated container.

And so, it can be interesting to have a way to create an alert on this kind of "bad" Pod state, which is currently not possible to do since the pod is flag by kube_pod_status_phase metrics as "Running".

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 24, 2019
@jinnovation
Copy link

This PR seems like it addresses #348 as well, which was closed w/o resolution.

@clamoriniere
Copy link
Contributor Author

This PR seems like it addresses #348 as well, which was closed w/o resolution.

@jinnovation it is exactly the use case that we want to monitor with this fix: Pod blocked in Terminated state that require specific operation on the node.

@@ -274,6 +278,9 @@ var (
}
}

// compute the "Terminating" phase in order to reflect what a user can see with kubectl: https://github.com/kubernetes/kubernetes/blob/6a4216ba59ce1d09c8ac1c6229649bb71b7f6c85/pkg/printers/internalversion/printers.go#L737-L741
isPodTerminating := p.DeletionTimestamp != nil && p.Status.Reason != nodeUnreachablePodReason
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really how kube state metrics is implemented otherwise. We usually just expose the raw data and leave computations like this to query time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree, but some logic was already added to compute the Unknow status.
https://github.com/kubernetes/kube-state-metrics/pull/1013/files#diff-791ee1d484ea70de1ecd1d8875f8f5c0L286

Here we want to reflect what is displayed by "kubectl" to avoid confusion, and avoid having a Pod in a Running state when it is not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was a mistake and since we're in the process of cleaning up kube-state-metrics and releasing preparing a breaking release, the time is right to change this. @lilic @tariq1890 what do you think? We should definitely make sure that for these cases we have the required raw data available so we can compute these things at runtime, but we should strive to remove any of these types of things from kube-state-metrics, just like we state in our docs :) .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agreed, we should expose just raw metrics, users can compute whatever they want afterwards, and/or create recording rules for this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way with the current information to know how many pods have the deletionTimestamp set?
With this info, we could indeed easily compute the "Terminating" status

Copy link
Contributor Author

@clamoriniere clamoriniere Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brancz and @lilic
I can rework this PR to add missing metrics of "raw" information to allow computing the Terminating and Unknown pod state from metrics.
Does that sound good?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that sounds perfect! :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR updated with 2 new metrics:

  • kube_pod_deleted
  • kube_pod_status_reason

I can create another PR to remove the current logic for the Unknown pod phase.

@clamoriniere clamoriniere force-pushed the feature/addTerminatingPhaseInPod_Status_Phase_metric branch from 5fd6a40 to e69b6fc Compare January 15, 2020 17:38
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 15, 2020
@lilic
Copy link
Member

lilic commented Jan 16, 2020

Tests are failing as you need to update the docs for the metrics.

@clamoriniere
Copy link
Contributor Author

Hi @lilic
sorry about the failing tests.

I updated the PR with proper documentation and also flagged new metrics as "EXPERIMENTAL" in the documentation.

@clamoriniere clamoriniere force-pushed the feature/addTerminatingPhaseInPod_Status_Phase_metric branch from fd9409a to 680456e Compare January 16, 2020 17:40
@clamoriniere clamoriniere force-pushed the feature/addTerminatingPhaseInPod_Status_Phase_metric branch from 680456e to 2aff65e Compare January 16, 2020 18:22
@clamoriniere
Copy link
Contributor Author

Hi @tariq1890
I'm wondering if you may have some time to review my last code update?
thanks

@tariq1890
Copy link
Contributor

@clamoriniere Sorry for the delay and Thank you so much for your patience with this PR! It's greatly appreciated :).

The code looks good, my last request is for you to provide a few examples (doesn;'t have to be complete) of using these raw basic metrics together for asserting on particular pod states (such as Terminating) in the README. This will be of great help to people who are transitioning over to the next version.

@clamoriniere
Copy link
Contributor Author

Ok no problem.

@tariq1890
Copy link
Contributor

@clamoriniere Just following up on this :)

@clamoriniere
Copy link
Contributor Author

@clamoriniere Just following up on this :)

Hi @tariq1890,
Sorry I was in vacation last week.

  • I added the promQL queries in the docs/pod-metrics.md file. I thought it makes more sens in this file than in the README.md.
  • I also needed to update a bit the make doccheck to support extra documentation in the docs/pod-metrics.md file.

Let me know if you prefer that I move the doc in the README.md

@clamoriniere clamoriniere force-pushed the feature/addTerminatingPhaseInPod_Status_Phase_metric branch from 54e369a to f0f27ae Compare February 5, 2020 12:04
Aim of this PullRequest is to add several new pod metrics:
- kube_pod_deleted: Unix deletion timestamp
- kube_pod_status_reason: The pod status reasons (NodeLost, Evicted)
These new metrics can be used to determine the `pod.status.phase`
displayed by `kubectl`: Running, Terminated, Unknown…
It will allow removing the `kubectl` display logic added previously
to compute the “Unknown” phase.

Signed-off-by: cedric lamoriniere <cedric.lamoriniere@datadoghq.com>
Signed-off-by: cedric lamoriniere <cedric.lamoriniere@datadoghq.com>
Signed-off-by: cedric lamoriniere <cedric.lamoriniere@datadoghq.com>
Signed-off-by: cedric lamoriniere <cedric.lamoriniere@datadoghq.com>
@clamoriniere clamoriniere force-pushed the feature/addTerminatingPhaseInPod_Status_Phase_metric branch from f0f27ae to 02844c3 Compare February 5, 2020 18:11
Copy link
Contributor

@tariq1890 tariq1890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clamoriniere Thank you! Last set of review comments and this should be good to go :)

docs/pod-metrics.md Outdated Show resolved Hide resolved
docs/pod-metrics.md Outdated Show resolved Hide resolved
docs/pod-metrics.md Outdated Show resolved Hide resolved
docs/pod-metrics.md Outdated Show resolved Hide resolved
@clamoriniere
Copy link
Contributor Author

clamoriniere commented Feb 5, 2020

@tariq1890
thanks for the review and suggestions. PR updated

let me know if you want me to do another PR that removes the current logic to generate the kube_pod_status_phase{phase="Unknown"} metric. (link:

// This logic is directly copied from: https://github.com/kubernetes/kubernetes/blob/d39bfa0d138368bbe72b0eaf434501dcb4ec9908/pkg/printers/internalversion/printers.go#L597-L601
)

@tariq1890
Copy link
Contributor

Sure! That sounds good to me :)

Signed-off-by: cedric lamoriniere <cedric.lamoriniere@datadoghq.com>
@clamoriniere clamoriniere force-pushed the feature/addTerminatingPhaseInPod_Status_Phase_metric branch from 564ed16 to 311c682 Compare February 5, 2020 20:40
Copy link
Contributor

@tariq1890 tariq1890 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 5, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: clamoriniere, tariq1890

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2020
@k8s-ci-robot k8s-ci-robot merged commit 2148cb9 into kubernetes:master Feb 5, 2020
@clamoriniere clamoriniere deleted the feature/addTerminatingPhaseInPod_Status_Phase_metric branch February 6, 2020 08:19
clamoriniere added a commit to clamoriniere/kube-state-metrics that referenced this pull request Feb 6, 2020
As explained in the doc: "kube-state-metrics exposes raw data
unmodified from the Kubernetes API".

The goal of this change was to remove the kubctl logic introduced to
get the `Unknown` phase in `kube_pod_status_phase` metric.

With the introduction of `kube_pod_deleted` metric, it is now possible
to get the same result with a promQL query. (more info in this kubernetes#1013).

Signed-off-by: cedric lamoriniere <cedric.lamoriniere@datadoghq.com>
@buttonbuilder
Copy link

Which release will this be included in ?

@clamoriniere
Copy link
Contributor Author

Which release will this be included in ?

I think in the next major release: v2.0.0

@vc43
Copy link

vc43 commented Jan 13, 2023

I have a pod in status Terminating but whith kube-state-metrics:v2.7.0 can not see kube_pod_status_phase{phase="Terminating"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants