Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have a metric that introspects why pods failed in the cluster #725

Open
3 tasks
fridex opened this issue Jun 24, 2021 · 29 comments
Open
3 tasks

Have a metric that introspects why pods failed in the cluster #725

fridex opened this issue Jun 24, 2021 · 29 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/observability Categorizes an issue or PR as relevant to SIG Observability triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@fridex
Copy link
Contributor

fridex commented Jun 24, 2021

Is your feature request related to a problem? Please describe.

As Thoth operator, I would like to know why solver failed in the cluster - (e.g. if they failed due to OOM)

As Thoth operator, I would like to know why advisers failed in the cluster - (e.g. wrong user inputs, ...).

Describe the solution you'd like

Have a metric that exposes information about exit code returned by the corresponding container in a workflow.

We can sync how these components return the exit code and the semantics behind these exit codes.

Acceptance criteria

  • Explore the existing metrics, workflow metrics and document them for the pod failure based on the above case
  • Design the metrics to be included in metrics-exporter , to expose metrics for failure or error case
  • Design and implement dashboard pannel based on these metrics.
@fridex fridex added kind/feature Categorizes issue or PR as related to a new feature. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jun 24, 2021
@pacospace
Copy link
Contributor

pacospace commented Jun 29, 2021

Is your feature request related to a problem? Please describe.

As Thoth operator, I would like to know why solver failed in the cluster - (e.g. if they failed due to OOM)

As Thoth operator, I would like to know why advisers failed in the cluster - (e.g. wrong user inputs, ...).

Describe the solution you'd like

Have a metric that exposes information about exit code returned by the corresponding container in a workflow.

We can sync how these components return the exit code and the semantics behind these exit codes.

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

@pacospace
Copy link
Contributor

Screenshot from 2021-06-29 18-57-16

@fridex
Copy link
Contributor Author

fridex commented Jun 30, 2021

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

Are these computed by reported based on documents stored on ceph?

@pacospace
Copy link
Contributor

pacospace commented Jun 30, 2021

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

Are these computed by reported based on documents stored on ceph?

Yes, analyze every morning for the day before by thoth reporter, we can make this analysis more often during the day to collect more data points. wdyt?

@fridex
Copy link
Contributor Author

fridex commented Jun 30, 2021

Yes, analyze every morning for the day before by thoth reporter, we can make this analysis more often during the day to collect more data points. wdyt?

Daily sounds reasonable. 👍🏻

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

So back to this one. An example to reason $SUBJ metric: as of now, our prod environment fails to give any recommendations as it is in an inconsistent state (thoth-station/thoth-application#1766) - database queries expect platform column but that column does not exist in the database, hence adviser fails with the following error (and corresponding exit code):

          The resolution failed as an error was encountered: Failed to run pipeline boot 'PlatformBoot': (psycopg2.errors.UndefinedColumn) column depends_on.platform does not exist           
                                                                LINE 3: WHERE depends_on.platform = 'linux-x86_64') AS anon_1                                                                  
                                                                                                     ^                                                                                         
                                                                                                                                                                                               
                                                                                [SQL: SELECT EXISTS (SELECT *                                                                                  
                                                                                       FROM depends_on                                                                                         
                                                                    WHERE depends_on.platform = %(platform_1)s) AS anon_1]                                                                     
                                                                                                                                                                                               
                                                                 (Background on this error at: http://sqlalche.me/e/13/f405)  

With metrics reported by the reporter, we will know about this issue one day later, not in real-time - that will not give us insights about the system - how the system works right now and what actions should be done to recover from the error state.

If the situation with an inconsistent system occurs accidentally again someday in the future, we should be alerted "recommender system is giving too many errors in adviser pods with these exit codes, system operator should have a look at it". That way, we will keep the system up and will make sure that if there is any misbehavior, the system operator should have a look at it immediately based on the alert (before users start to complain).

Inspecting exit codes is one thing, having info about failed workflows (e.g. platform fails to bring a pod up) is another thing to consider in this case.

@pacospace
Copy link
Contributor

pacospace commented Jun 30, 2021

Yes, analyze every morning for the day before by thoth reporter, we can make this analysis more often during the day to collect more data points. wdyt?

Daily sounds reasonable. 👍🏻

We have this metric already, it gives back the percentage of justifications with ERROR on the failed advisers, including when fails due to OOM or CPU exceeded, is it enough? Or do we need to look at the single exit codes? wdyt @fridex ?

So back to this one. An example to reason $SUBJ metric: as of now, our prod environment fails to give any recommendations as it is in an inconsistent state (thoth-station/thoth-application#1766) - database queries expect platform column but that column does not exist in the database, hence adviser fails with the following error (and corresponding exit code):

          The resolution failed as an error was encountered: Failed to run pipeline boot 'PlatformBoot': (psycopg2.errors.UndefinedColumn) column depends_on.platform does not exist           
                                                                LINE 3: WHERE depends_on.platform = 'linux-x86_64') AS anon_1                                                                  
                                                                                                     ^                                                                                         
                                                                                                                                                                                               
                                                                                [SQL: SELECT EXISTS (SELECT *                                                                                  
                                                                                       FROM depends_on                                                                                         
                                                                    WHERE depends_on.platform = %(platform_1)s) AS anon_1]                                                                     
                                                                                                                                                                                               
                                                                 (Background on this error at: http://sqlalche.me/e/13/f405)  

With metrics reported by the reporter, we will know about this issue one day later, not in real-time - that will not give us insights about the system - how the system works right now and what actions should be done to recover from the error state.

If the situation with an inconsistent system occurs accidentally again someday in the future, we should be alerted "recommender system is giving too many errors in adviser pods with these exit codes, system operator should have a look at it". That way, we will keep the system up and will make sure that if there is any misbehavior, the system operator should have a look at it immediately based on the alert (before users start to complain).

Inspecting exit codes is one thing, having info about failed workflows (e.g. platform fails to bring a pod up) is another thing to consider in this case.

I see your point, in that case what justification is reported by adviser? So we need to find a way to read exit code of the pods to be reported immediately (we only have the percentage of adviser failures every moment and then asynchronously we analyze the reason from the documents on Ceph)

Screenshot from 2021-06-30 10-17-58

here errors are decreasnig but workflows failures are increasing (ocp4-stage), while succeeded one are not changing much

We have another metrics on number of requests vs number of reports created on Ceph at the moment (also evaluated async once per day from Ceph analysis), in that case if they do not match for long time, something wrong is happening in the system (e.g. Kafka off (another metrics is available for that), database is off)

@fridex
Copy link
Contributor Author

fridex commented Jun 30, 2021

in that case what justification is reported by adviser?

There is no justification created as the pod errored. adviser reports the followin error information:

    "report": {
      "ERROR": "An error occurred, see logs for more info"
    }

We have another metrics on number of requests vs number of reports created on Ceph at the moment (also evaluated async once per day from Ceph analysis), in that case if they do not match for long time, something wrong is happening in the system (e.g. Kafka off (another metrics is available for that), database is off)

Yes, this metric discussed before on calls is not applicable to this case - in this case, the system produces documents, but does not satisfy user requests. The metric you brought introspects if system produces any documents (and should alert as well if not).

@goern
Copy link
Member

goern commented Jun 30, 2021

/priority important-soon
/remove-triage needs-information
/triage accept

@sesheta sesheta added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Jun 30, 2021
@sesheta
Copy link
Member

sesheta commented Jun 30, 2021

@goern: The label(s) triage/accept cannot be applied, because the repository doesn't have them.

In response to this:

/priority important-soon
/remove-triage needs-information
/triage accept

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@goern
Copy link
Member

goern commented Jun 30, 2021

/triage accepted

@sesheta sesheta added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 30, 2021
@sesheta
Copy link
Member

sesheta commented Jul 30, 2021

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 30, 2021
@fridex
Copy link
Contributor Author

fridex commented Jul 30, 2021

/remove-lifecycle rotten

@sesheta sesheta removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 30, 2021
@sesheta
Copy link
Member

sesheta commented Aug 29, 2021

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@sesheta
Copy link
Member

sesheta commented Aug 29, 2021

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sesheta sesheta closed this as completed Aug 29, 2021
@pacospace pacospace reopened this Aug 30, 2021
@harshad16 harshad16 added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 27, 2021
@goern
Copy link
Member

goern commented Jan 14, 2022

/priority important-longterm

@sesheta sesheta added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jan 14, 2022
@pacospace pacospace removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 7, 2022
@harshad16 harshad16 added the sig/observability Categorizes an issue or PR as relevant to SIG Observability label Aug 11, 2022
@codificat
Copy link
Member

/sig observability

@VannTen
Copy link
Member

VannTen commented Aug 18, 2022

Potentially relevant metrics

From kube-state-metrics:

  • kube_pod_container_status_last_terminated_reason
    sample: kube_pod_container_status_last_terminated_reason{cluster="emea/balrog", container="acm-agent", endpoint="https-main", job="kube-state-metrics", namespace="open-cluster-management-agent-addon", pod="klusterlet-addon-workmgr-65c7c49798-z7jc2", prometheus="openshift-monitoring/k8s", reason="Error", service="kube-state-metrics"}

From argo workflow controller:

  • argo_worklow_error_count
    sample: argo_workflows_error_count{cause="CronWorkflowSubmissionError", cluster="emea/balrog", endpoint="metrics", field="workflow-controller-metrics-thoth-backend-prod.apps.balrog.aws.operate-first.cloud", instance="10.128.2.40:8080", job="workflow-controller-metrics", namespace="thoth-backend-prod", pod="workflow-controller-58dccdddb6-49cv8", prometheus="openshift-user-workload-monitoring/user-workload", service="workflow-controller-metrics"}

+ all the metrics documented at https://argoproj.github.io/argo-workflows/metrics/#default-controller-metrics, probably

Beyond that, custom workflow metrics (metrics defined in Workflow spec, from what I gather) looks relevant.

@VannTen
Copy link
Member

VannTen commented Aug 25, 2022

relevant : kubernetes/kube-state-metrics#1481 (the issue is only closed because it's old, not because it's refused).

@VannTen
Copy link
Member

VannTen commented Aug 25, 2022

Some opinions.
If we only need exit codes, I don't think the application is the right level for implementing:

  • The application can't access the exit codes, by definition.
  • It's universal for any kind of containers (process, in fact).
    IMO, this point to implementing the linked feature request in kube-state-metrics. -> And in fact I had not seen it, but there is a PR linked add exit code kubernetes/kube-state-metrics#1752

Since exit codes (most of them) we can use them to map to any reason we like.
However, if the number of possible reason is unbounded (or just > 126) we'll probably want to use another mechanism.

@VannTen
Copy link
Member

VannTen commented Aug 26, 2022

(I'll unssagnim myself, I don't think we have a clear enough view of what we want to do yet with this)
(and it was a little quiproquo on the sig call

@VannTen VannTen removed their assignment Aug 26, 2022
@VannTen
Copy link
Member

VannTen commented Sep 2, 2022

I think we should use the kube-state-metrics feature once the previously linked
PR is merged.

Unless someone has a different opinion, I propose we keep this frozen until the
PR is merged and subsequent release of kube-state-metrics.

@VannTen
Copy link
Member

VannTen commented Sep 5, 2022

The kube-state-metrics got merged.

I'll keep an eye on this when they release a new version.
/assign

@VannTen
Copy link
Member

VannTen commented Sep 9, 2022

kube-state-metrics do releases something like every 2/4 months, from their
history. Last one was 16 days ago, so it might take some time before a new one.

Do we have an idea of what the timeline is for:

kube-state-metrics new releases -> get in Openshift -> get on the clusters we
use ? I don't have much visibility on this.

Also, if we decide to go that route(=using kube-state-metrics) (do we ?), we
should update the issue acceptance critera.

Suggestion:
Acceptance critera:

  • upgrade to (yet-to-be-released) version of kube-state-metrics on our
    environments.
  • Implement dashboard for historical data analysis using
  • Implements alerts for abnormal behavior detection.

Description:

Use kube_pod_container_status_last_terminated_exitcode from
kube-state-metrics in conjuction with labels from argo-workflows as the main
metric source for dashboard and alerts.

@goern
Copy link
Member

goern commented Sep 12, 2022

sounds good to me. which of the parts if on op1st and which on us?

@VannTen
Copy link
Member

VannTen commented Sep 13, 2022

They have the producer items (upgrade kube-state-metrics) we have the consumer (create the dasboard + alerts) ones (assuming those are handled as applications components in thoth-station)

@goern
Copy link
Member

goern commented Sep 13, 2022

@VannTen did you open in issue to update kube-state-metrics?

@VannTen
Copy link
Member

VannTen commented Sep 13, 2022 via email

@goern
Copy link
Member

goern commented Sep 22, 2022

ACK
/remove-lifecycle frozen

@sesheta sesheta removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 22, 2022
@harshad16 harshad16 moved this from 🔖 Next to Blocked in Planning Board Sep 29, 2022
@VannTen
Copy link
Member

VannTen commented Jan 10, 2023

It looks like we should monitor https://github.com/openshift/cluster-monitoring-operator and/or https://github.com/openshift/kube-state-metrics .

I'll check the git history later to see if the exit_code PR is there, and in which release branch.

@VannTen VannTen removed their assignment Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/observability Categorizes an issue or PR as relevant to SIG Observability triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Blocked
Status: 🔖 Ready
Development

No branches or pull requests

7 participants