add alerts #18

TheKangaroo · 2023-12-14T12:45:37Z

I had a hard time finding some reference prometheus alerts (PrometheusRules) to set up an actual alerting in addition to my flux monitoring and dashboards.
So I decided to build some alerts for our setup myself.
If this is something you are interested in adding to this repo, I'll be happy to send you a PR with some basic PrometheusRules.

darkowlzz · 2023-12-18T13:49:24Z

Hi, at present, this repository only provides the basic flux monitoring setup using kube-prometheus-stack for the Flux monitoring docs https://fluxcd.io/flux/monitoring. The alerts page in the docs refer to alerting using Flux notificaton-controller. Since you have mentioned prometheus alerts, I'm assuming you would like to set up alerts on prometheus metrics. Alertmanager is disabled in the example configuration. I believe alerting could be subjective depending on the user and their environment. Some may like to use prometheus alertmanager, others may prefer grafana for the same. I think we have an assumption here that the users of these monitoring systems would know how to configure these systems themselves and we only provide the minimal example to get started. This repository only serves as an example and shouldn't be consumed directly as we don't offer compatibility guarantee. I think we would prefer to avoid breaking alerts silently for the users with an update to this repository. It is recommended to use this repository only as a reference and build your own monitoring configuration for your environment.
I hope this helps understand why we don't have examples for alerts on metrics in this repository. But maybe depending on the user feedback, we can provide more examples that's easier to maintain in the long run.

TheKangaroo · 2023-12-18T14:18:27Z

Okay, sure.
Let me explain why I thought it would be a good idea to add some example alerts.
I have been working with flux for half a year now and the last piece missing to go into production was some sort of notification about failed flux resources. I skimmed through the monitoring documentation and as we use the kube-prometheus-stack a lot I thought I should use the alertmanager provider. I couldn't get it to work and still have no idea how it's supposed to work. Teams provider with the same config works like a charm, but we need to push events to alertmanager for further distribution to receivers (mostly opsgenie).
I was using the monitoring and dashboards from this repo though, and remembered that the grafana dashboard already had the information I needed from KSM, so I decided to search online for Prometheus alerts. I couldn't find any, so I went ahead and built my own set of alerts.

For the opinionated point, I think it is the default monitoring pipeline for kube-prometheus-stack users to use alertmanager (with different backends there), and since the KSM config is already present in this repo, I think it is just a matter of enabling alertmanager and adding PrometheusAlerts.
But I'm just a new user of flux and maybe I lack the overview of monitoring and alerting in flux.
I'm perfectly fine with not adding the alerts to this repo, I just wanted to spare someone the pain of writing the same Prometheus alerts I did in the past :)

Feel free to close this issue if it doesn't fit the scope of this minimal example repo.

antonblr · 2024-01-04T22:50:56Z

@TheKangaroo - I'd love to see your PR with PrometheusRule covering Flux2 operational cases worth alerting. If not here, your PR will be happily accepted at https://github.com/samber/awesome-prometheus-alerts/blob/master/CONTRIBUTING.md, I think.

They have sample rules for ArgoCD: https://samber.github.io/awesome-prometheus-alerts/rules.html#argocd, so Flux2 CD rules will fit there just fine.

TheKangaroo · 2024-01-05T07:42:27Z

@antonblr I don't know if it's possible to add these alerts to awesome-prometheus-alerts as they rely on the custom kube-state-metrics config in this repo.
If it is possible to add this as a usage description in awesome-prometheus-alerts, I'll be happy to provide a PR there.

antonblr · 2024-01-05T16:46:49Z

@TheKangaroo - I see. Yeah, looks like all samples there are built around already exposed metrics. But let's wait for what they say.

kingdonb · 2024-04-23T13:57:37Z

I have just seen your post, sorry for slow response!

There actually used to be an alertmanager example in the Flux docs, but it was lost in a refactor some time ago.

It was a bit problematic because the example did not come with full detail instruction about how to configure the AlertManager - it was just an alert assuming you have already done that. We discussed this one week at Bug Scrub and understood that if I am a new Kubernetes and Flux user following our Prometheus guide, I most certainly have not already configured the AlertManager for myself 😆 the Alert addition to the guide is incomplete without that addendum.

I have one of my clusters still configured to use AlertManager, with some custom alerts and other configuration based on the earlier Flux monitoring example here:

https://github.com/kingdonb/flux2/tree/monitoring

It is very far behind and cannot easily be rebased now because of the refactor into a separate repo. But I will try to cobble something together out of this experience and make a minimum viable guide for Flux setup AlertManager on a new cluster.

In the meanwhile, the examples I can already contribute are mixed in here with a deprecation notice:

https://github.com/kingdonb/flux2/tree/monitoring/manifests/monitoring

https://github.com/kingdonb/flux2/blob/ddf3c495133a2e49e20c97588887f01bb2f6b104/manifests/monitoring/kube-prometheus-stack/release.yaml#L460-L468
^ here is the specific rule:

              - name: GitOpsToolkit
                rules:
                  - alert: ReconciliationFailure
                    expr: max(gotk_reconcile_condition{status="False",type="Ready"}) by (exported_namespace, name, kind) + on(exported_namespace, name, kind) (max(gotk_reconcile_condition{status="Deleted"}) by (exported_namespace, name, kind)) * 2 == 1
                    for: 15m
                    labels:
                      severity: page
                    annotations:
                      summary: '{{ $labels.kind }} {{ $labels.exported_namespace }}/{{ $labels.name }} reconciliation has been failing for more than ten minutes.'

which you can find historically in the flux2 docs, if you dig past the genesis of the flux2-monitoring-example repo in the website history, where that doc once lived.

Edit: I will have to update that one, as it still uses the Deprecated Resource Metric

TheKangaroo mentioned this issue Jan 5, 2024

flux alerts samber/awesome-prometheus-alerts#396

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add alerts #18

add alerts #18

TheKangaroo commented Dec 14, 2023

darkowlzz commented Dec 18, 2023

TheKangaroo commented Dec 18, 2023

antonblr commented Jan 4, 2024

TheKangaroo commented Jan 5, 2024

antonblr commented Jan 5, 2024

kingdonb commented Apr 23, 2024 •

edited

Loading

add alerts #18

add alerts #18

Comments

TheKangaroo commented Dec 14, 2023

darkowlzz commented Dec 18, 2023

TheKangaroo commented Dec 18, 2023

antonblr commented Jan 4, 2024

TheKangaroo commented Jan 5, 2024

antonblr commented Jan 5, 2024

kingdonb commented Apr 23, 2024 • edited Loading

kingdonb commented Apr 23, 2024 •

edited

Loading