Have a metric that introspects why pods failed in the cluster #725

fridex · 2021-06-24T14:39:18Z

Is your feature request related to a problem? Please describe.

As Thoth operator, I would like to know why solver failed in the cluster - (e.g. if they failed due to OOM)

As Thoth operator, I would like to know why advisers failed in the cluster - (e.g. wrong user inputs, ...).

Describe the solution you'd like

Have a metric that exposes information about exit code returned by the corresponding container in a workflow.

We can sync how these components return the exit code and the semantics behind these exit codes.

Acceptance criteria

Explore the existing metrics, workflow metrics and document them for the pod failure based on the above case
Design the metrics to be included in metrics-exporter , to expose metrics for failure or error case
Design and implement dashboard pannel based on these metrics.

pacospace · 2021-06-29T16:52:07Z

Is your feature request related to a problem? Please describe.

As Thoth operator, I would like to know why solver failed in the cluster - (e.g. if they failed due to OOM)

As Thoth operator, I would like to know why advisers failed in the cluster - (e.g. wrong user inputs, ...).

Describe the solution you'd like

Have a metric that exposes information about exit code returned by the corresponding container in a workflow.

We can sync how these components return the exit code and the semantics behind these exit codes.

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

pacospace · 2021-06-29T16:57:50Z

fridex · 2021-06-30T07:33:46Z

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

Are these computed by reported based on documents stored on ceph?

pacospace · 2021-06-30T07:58:27Z

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

Are these computed by reported based on documents stored on ceph?

Yes, analyze every morning for the day before by thoth reporter, we can make this analysis more often during the day to collect more data points. wdyt?

fridex · 2021-06-30T08:10:39Z

Yes, analyze every morning for the day before by thoth reporter, we can make this analysis more often during the day to collect more data points. wdyt?

Daily sounds reasonable. 👍🏻

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

So back to this one. An example to reason $SUBJ metric: as of now, our prod environment fails to give any recommendations as it is in an inconsistent state (thoth-station/thoth-application#1766) - database queries expect platform column but that column does not exist in the database, hence adviser fails with the following error (and corresponding exit code):

          The resolution failed as an error was encountered: Failed to run pipeline boot 'PlatformBoot': (psycopg2.errors.UndefinedColumn) column depends_on.platform does not exist           
                                                                LINE 3: WHERE depends_on.platform = 'linux-x86_64') AS anon_1                                                                  
                                                                                                     ^                                                                                         
                                                                                                                                                                                               
                                                                                [SQL: SELECT EXISTS (SELECT *                                                                                  
                                                                                       FROM depends_on                                                                                         
                                                                    WHERE depends_on.platform = %(platform_1)s) AS anon_1]                                                                     
                                                                                                                                                                                               
                                                                 (Background on this error at: http://sqlalche.me/e/13/f405)

With metrics reported by the reporter, we will know about this issue one day later, not in real-time - that will not give us insights about the system - how the system works right now and what actions should be done to recover from the error state.

If the situation with an inconsistent system occurs accidentally again someday in the future, we should be alerted "recommender system is giving too many errors in adviser pods with these exit codes, system operator should have a look at it". That way, we will keep the system up and will make sure that if there is any misbehavior, the system operator should have a look at it immediately based on the alert (before users start to complain).

Inspecting exit codes is one thing, having info about failed workflows (e.g. platform fails to bring a pod up) is another thing to consider in this case.

pacospace · 2021-06-30T08:19:18Z

Yes, analyze every morning for the day before by thoth reporter, we can make this analysis more often during the day to collect more data points. wdyt?

Daily sounds reasonable. 👍🏻

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

So back to this one. An example to reason $SUBJ metric: as of now, our prod environment fails to give any recommendations as it is in an inconsistent state (thoth-station/thoth-application#1766) - database queries expect platform column but that column does not exist in the database, hence adviser fails with the following error (and corresponding exit code):
          The resolution failed as an error was encountered: Failed to run pipeline boot 'PlatformBoot': (psycopg2.errors.UndefinedColumn) column depends_on.platform does not exist           
                                                                LINE 3: WHERE depends_on.platform = 'linux-x86_64') AS anon_1                                                                  
                                                                                                     ^                                                                                         
                                                                                                                                                                                               
                                                                                [SQL: SELECT EXISTS (SELECT *                                                                                  
                                                                                       FROM depends_on                                                                                         
                                                                    WHERE depends_on.platform = %(platform_1)s) AS anon_1]                                                                     
                                                                                                                                                                                               
                                                                 (Background on this error at: http://sqlalche.me/e/13/f405)  
With metrics reported by the reporter, we will know about this issue one day later, not in real-time - that will not give us insights about the system - how the system works right now and what actions should be done to recover from the error state.

If the situation with an inconsistent system occurs accidentally again someday in the future, we should be alerted "recommender system is giving too many errors in adviser pods with these exit codes, system operator should have a look at it". That way, we will keep the system up and will make sure that if there is any misbehavior, the system operator should have a look at it immediately based on the alert (before users start to complain).

Inspecting exit codes is one thing, having info about failed workflows (e.g. platform fails to bring a pod up) is another thing to consider in this case.

I see your point, in that case what justification is reported by adviser? So we need to find a way to read exit code of the pods to be reported immediately (we only have the percentage of adviser failures every moment and then asynchronously we analyze the reason from the documents on Ceph)

here errors are decreasnig but workflows failures are increasing (ocp4-stage), while succeeded one are not changing much

We have another metrics on number of requests vs number of reports created on Ceph at the moment (also evaluated async once per day from Ceph analysis), in that case if they do not match for long time, something wrong is happening in the system (e.g. Kafka off (another metrics is available for that), database is off)

fridex · 2021-06-30T08:24:31Z

in that case what justification is reported by adviser?

There is no justification created as the pod errored. adviser reports the followin error information:

    "report": {
      "ERROR": "An error occurred, see logs for more info"
    }

We have another metrics on number of requests vs number of reports created on Ceph at the moment (also evaluated async once per day from Ceph analysis), in that case if they do not match for long time, something wrong is happening in the system (e.g. Kafka off (another metrics is available for that), database is off)

Yes, this metric discussed before on calls is not applicable to this case - in this case, the system produces documents, but does not satisfy user requests. The metric you brought introspects if system produces any documents (and should alert as well if not).

goern · 2021-06-30T13:33:21Z

/priority important-soon
/remove-triage needs-information
/triage accept

sesheta · 2021-06-30T13:33:24Z

@goern: The label(s) triage/accept cannot be applied, because the repository doesn't have them.

In response to this:

/priority important-soon
/remove-triage needs-information
/triage accept

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

goern · 2021-06-30T13:33:43Z

/triage accepted

sesheta · 2021-07-30T14:07:16Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

fridex · 2021-07-30T14:14:49Z

/remove-lifecycle rotten

sesheta · 2021-08-29T17:54:16Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

sesheta · 2021-08-29T17:54:19Z

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

goern · 2022-01-14T12:36:18Z

/priority important-longterm

codificat · 2022-08-11T14:11:16Z

/sig observability

VannTen · 2022-08-18T12:57:31Z

Potentially relevant metrics

From kube-state-metrics:

kube_pod_container_status_last_terminated_reason
sample: kube_pod_container_status_last_terminated_reason{cluster="emea/balrog", container="acm-agent", endpoint="https-main", job="kube-state-metrics", namespace="open-cluster-management-agent-addon", pod="klusterlet-addon-workmgr-65c7c49798-z7jc2", prometheus="openshift-monitoring/k8s", reason="Error", service="kube-state-metrics"}

From argo workflow controller:

argo_worklow_error_count
sample: argo_workflows_error_count{cause="CronWorkflowSubmissionError", cluster="emea/balrog", endpoint="metrics", field="workflow-controller-metrics-thoth-backend-prod.apps.balrog.aws.operate-first.cloud", instance="10.128.2.40:8080", job="workflow-controller-metrics", namespace="thoth-backend-prod", pod="workflow-controller-58dccdddb6-49cv8", prometheus="openshift-user-workload-monitoring/user-workload", service="workflow-controller-metrics"}

+ all the metrics documented at https://argoproj.github.io/argo-workflows/metrics/#default-controller-metrics, probably

Beyond that, custom workflow metrics (metrics defined in Workflow spec, from what I gather) looks relevant.

VannTen · 2022-08-25T12:41:19Z

relevant : kubernetes/kube-state-metrics#1481 (the issue is only closed because it's old, not because it's refused).

VannTen · 2022-08-25T13:20:10Z

Some opinions.
If we only need exit codes, I don't think the application is the right level for implementing:

The application can't access the exit codes, by definition.
It's universal for any kind of containers (process, in fact).
IMO, this point to implementing the linked feature request in kube-state-metrics. -> And in fact I had not seen it, but there is a PR linked add exit code kubernetes/kube-state-metrics#1752

Since exit codes (most of them) we can use them to map to any reason we like.
However, if the number of possible reason is unbounded (or just > 126) we'll probably want to use another mechanism.

VannTen · 2022-08-26T07:07:38Z

(I'll unssagnim myself, I don't think we have a clear enough view of what we want to do yet with this)
(and it was a little quiproquo on the sig call

VannTen · 2022-09-02T09:56:27Z

I think we should use the kube-state-metrics feature once the previously linked
PR is merged.

Unless someone has a different opinion, I propose we keep this frozen until the
PR is merged and subsequent release of kube-state-metrics.

VannTen · 2022-09-05T09:16:03Z

The kube-state-metrics got merged.

I'll keep an eye on this when they release a new version.
/assign

VannTen · 2022-09-09T09:38:44Z

kube-state-metrics do releases something like every 2/4 months, from their
history. Last one was 16 days ago, so it might take some time before a new one.

Do we have an idea of what the timeline is for:

kube-state-metrics new releases -> get in Openshift -> get on the clusters we
use ? I don't have much visibility on this.

Also, if we decide to go that route(=using kube-state-metrics) (do we ?), we
should update the issue acceptance critera.

Suggestion:
Acceptance critera:

upgrade to (yet-to-be-released) version of kube-state-metrics on our
environments.
Implement dashboard for historical data analysis using
Implements alerts for abnormal behavior detection.

Description:

Use kube_pod_container_status_last_terminated_exitcode from
kube-state-metrics in conjuction with labels from argo-workflows as the main
metric source for dashboard and alerts.

goern · 2022-09-12T18:35:07Z

sounds good to me. which of the parts if on op1st and which on us?

VannTen · 2022-09-13T07:17:38Z

They have the producer items (upgrade kube-state-metrics) we have the consumer (create the dasboard + alerts) ones (assuming those are handled as applications components in thoth-station)

goern · 2022-09-13T13:00:21Z

@VannTen did you open in issue to update kube-state-metrics?

VannTen · 2022-09-13T13:07:29Z

There isn't a release of kube-state-metrics with the merged PR yet, so I was thinking we should wait for it before opening an issue.

goern · 2022-09-22T06:50:54Z

ACK
/remove-lifecycle frozen

VannTen · 2023-01-10T14:01:58Z

It looks like we should monitor https://github.com/openshift/cluster-monitoring-operator and/or https://github.com/openshift/kube-state-metrics .

I'll check the git history later to see if the exit_code PR is there, and in which release branch.

fridex added kind/feature Categorizes issue or PR as related to a new feature. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jun 24, 2021

This was referenced Jun 29, 2021

Provide a metric exposing number of packages that were requested by a user but were not found as solved #724

Closed

Send metrics to pushgateway thoth-station/reporter#220

Merged

Bump reporter to v0.10.0 thoth-station/thoth-application#1778

Merged

sesheta added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Jun 30, 2021

sesheta added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 30, 2021

sesheta added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 30, 2021

sesheta removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 30, 2021

sesheta closed this as completed Aug 29, 2021

pacospace reopened this Aug 30, 2021

harshad16 added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 27, 2021

sesheta added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jan 14, 2022

pacospace removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 7, 2022

harshad16 added the sig/observability Categorizes an issue or PR as relevant to SIG Observability label Aug 11, 2022

harshad16 assigned VannTen Aug 25, 2022

VannTen removed their assignment Aug 26, 2022

sesheta assigned VannTen Sep 5, 2022

VannTen mentioned this issue Sep 5, 2022

Advise does not give a stack_info when OOM is thrown thoth-station/adviser#2203

Open

codificat added this to Planning Board Sep 6, 2022

codificat moved this to 🆕 New in Planning Board Sep 6, 2022

codificat moved this from 🆕 New to 🔖 Ready in Planning Board Sep 6, 2022

harshad16 moved this to Next in SIG-Observability Sep 22, 2022

harshad16 added this to SIG-Observability Sep 22, 2022

sesheta removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 22, 2022

harshad16 moved this from 🔖 Next to Blocked in Planning Board Sep 29, 2022

VannTen removed their assignment Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have a metric that introspects why pods failed in the cluster #725

Have a metric that introspects why pods failed in the cluster #725

fridex commented Jun 24, 2021 •

edited by harshad16

Loading

pacospace commented Jun 29, 2021 •

edited

Loading

pacospace commented Jun 29, 2021

fridex commented Jun 30, 2021

pacospace commented Jun 30, 2021 •

edited

Loading

fridex commented Jun 30, 2021

pacospace commented Jun 30, 2021 •

edited

Loading

fridex commented Jun 30, 2021

goern commented Jun 30, 2021

sesheta commented Jun 30, 2021

goern commented Jun 30, 2021

sesheta commented Jul 30, 2021

fridex commented Jul 30, 2021

sesheta commented Aug 29, 2021

sesheta commented Aug 29, 2021

goern commented Jan 14, 2022

codificat commented Aug 11, 2022

VannTen commented Aug 18, 2022

VannTen commented Aug 25, 2022

VannTen commented Aug 25, 2022

VannTen commented Aug 26, 2022 •

edited

Loading

VannTen commented Sep 2, 2022

VannTen commented Sep 5, 2022 •

edited

Loading

VannTen commented Sep 9, 2022

goern commented Sep 12, 2022

VannTen commented Sep 13, 2022 •

edited

Loading

goern commented Sep 13, 2022

VannTen commented Sep 13, 2022 via email

goern commented Sep 22, 2022

VannTen commented Jan 10, 2023

Have a metric that introspects why pods failed in the cluster #725

Have a metric that introspects why pods failed in the cluster #725

Comments

fridex commented Jun 24, 2021 • edited by harshad16 Loading

pacospace commented Jun 29, 2021 • edited Loading

pacospace commented Jun 29, 2021

fridex commented Jun 30, 2021

pacospace commented Jun 30, 2021 • edited Loading

fridex commented Jun 30, 2021

pacospace commented Jun 30, 2021 • edited Loading

fridex commented Jun 30, 2021

goern commented Jun 30, 2021

sesheta commented Jun 30, 2021

goern commented Jun 30, 2021

sesheta commented Jul 30, 2021

fridex commented Jul 30, 2021

sesheta commented Aug 29, 2021

sesheta commented Aug 29, 2021

goern commented Jan 14, 2022

codificat commented Aug 11, 2022

VannTen commented Aug 18, 2022

VannTen commented Aug 25, 2022

VannTen commented Aug 25, 2022

VannTen commented Aug 26, 2022 • edited Loading

VannTen commented Sep 2, 2022

VannTen commented Sep 5, 2022 • edited Loading

VannTen commented Sep 9, 2022

goern commented Sep 12, 2022

VannTen commented Sep 13, 2022 • edited Loading

goern commented Sep 13, 2022

VannTen commented Sep 13, 2022 via email

goern commented Sep 22, 2022

VannTen commented Jan 10, 2023

fridex commented Jun 24, 2021 •

edited by harshad16

Loading

pacospace commented Jun 29, 2021 •

edited

Loading

pacospace commented Jun 30, 2021 •

edited

Loading

pacospace commented Jun 30, 2021 •

edited

Loading

VannTen commented Aug 26, 2022 •

edited

Loading

VannTen commented Sep 5, 2022 •

edited

Loading

VannTen commented Sep 13, 2022 •

edited

Loading