Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add alerts #18

Open
TheKangaroo opened this issue Dec 14, 2023 · 6 comments
Open

add alerts #18

TheKangaroo opened this issue Dec 14, 2023 · 6 comments

Comments

@TheKangaroo
Copy link

I had a hard time finding some reference prometheus alerts (PrometheusRules) to set up an actual alerting in addition to my flux monitoring and dashboards.
So I decided to build some alerts for our setup myself.
If this is something you are interested in adding to this repo, I'll be happy to send you a PR with some basic PrometheusRules.

@darkowlzz
Copy link
Contributor

Hi, at present, this repository only provides the basic flux monitoring setup using kube-prometheus-stack for the Flux monitoring docs https://fluxcd.io/flux/monitoring. The alerts page in the docs refer to alerting using Flux notificaton-controller. Since you have mentioned prometheus alerts, I'm assuming you would like to set up alerts on prometheus metrics. Alertmanager is disabled in the example configuration. I believe alerting could be subjective depending on the user and their environment. Some may like to use prometheus alertmanager, others may prefer grafana for the same. I think we have an assumption here that the users of these monitoring systems would know how to configure these systems themselves and we only provide the minimal example to get started. This repository only serves as an example and shouldn't be consumed directly as we don't offer compatibility guarantee. I think we would prefer to avoid breaking alerts silently for the users with an update to this repository. It is recommended to use this repository only as a reference and build your own monitoring configuration for your environment.
I hope this helps understand why we don't have examples for alerts on metrics in this repository. But maybe depending on the user feedback, we can provide more examples that's easier to maintain in the long run.

@TheKangaroo
Copy link
Author

Okay, sure.
Let me explain why I thought it would be a good idea to add some example alerts.
I have been working with flux for half a year now and the last piece missing to go into production was some sort of notification about failed flux resources. I skimmed through the monitoring documentation and as we use the kube-prometheus-stack a lot I thought I should use the alertmanager provider. I couldn't get it to work and still have no idea how it's supposed to work. Teams provider with the same config works like a charm, but we need to push events to alertmanager for further distribution to receivers (mostly opsgenie).
I was using the monitoring and dashboards from this repo though, and remembered that the grafana dashboard already had the information I needed from KSM, so I decided to search online for Prometheus alerts. I couldn't find any, so I went ahead and built my own set of alerts.

For the opinionated point, I think it is the default monitoring pipeline for kube-prometheus-stack users to use alertmanager (with different backends there), and since the KSM config is already present in this repo, I think it is just a matter of enabling alertmanager and adding PrometheusAlerts.
But I'm just a new user of flux and maybe I lack the overview of monitoring and alerting in flux.
I'm perfectly fine with not adding the alerts to this repo, I just wanted to spare someone the pain of writing the same Prometheus alerts I did in the past :)

Feel free to close this issue if it doesn't fit the scope of this minimal example repo.

@antonblr
Copy link

antonblr commented Jan 4, 2024

@TheKangaroo - I'd love to see your PR with PrometheusRule covering Flux2 operational cases worth alerting. If not here, your PR will be happily accepted at https://github.com/samber/awesome-prometheus-alerts/blob/master/CONTRIBUTING.md, I think.

They have sample rules for ArgoCD: https://samber.github.io/awesome-prometheus-alerts/rules.html#argocd, so Flux2 CD rules will fit there just fine.

@TheKangaroo
Copy link
Author

@antonblr I don't know if it's possible to add these alerts to awesome-prometheus-alerts as they rely on the custom kube-state-metrics config in this repo.
If it is possible to add this as a usage description in awesome-prometheus-alerts, I'll be happy to provide a PR there.

@antonblr
Copy link

antonblr commented Jan 5, 2024

@TheKangaroo - I see. Yeah, looks like all samples there are built around already exposed metrics. But let's wait for what they say.

@kingdonb
Copy link
Member

kingdonb commented Apr 23, 2024

I have just seen your post, sorry for slow response!

There actually used to be an alertmanager example in the Flux docs, but it was lost in a refactor some time ago.

It was a bit problematic because the example did not come with full detail instruction about how to configure the AlertManager - it was just an alert assuming you have already done that. We discussed this one week at Bug Scrub and understood that if I am a new Kubernetes and Flux user following our Prometheus guide, I most certainly have not already configured the AlertManager for myself 😆 the Alert addition to the guide is incomplete without that addendum.

I have one of my clusters still configured to use AlertManager, with some custom alerts and other configuration based on the earlier Flux monitoring example here:

https://github.com/kingdonb/flux2/tree/monitoring

It is very far behind and cannot easily be rebased now because of the refactor into a separate repo. But I will try to cobble something together out of this experience and make a minimum viable guide for Flux setup AlertManager on a new cluster.

In the meanwhile, the examples I can already contribute are mixed in here with a deprecation notice:

https://github.com/kingdonb/flux2/tree/monitoring/manifests/monitoring

https://github.com/kingdonb/flux2/blob/ddf3c495133a2e49e20c97588887f01bb2f6b104/manifests/monitoring/kube-prometheus-stack/release.yaml#L460-L468
^ here is the specific rule:

              - name: GitOpsToolkit
                rules:
                  - alert: ReconciliationFailure
                    expr: max(gotk_reconcile_condition{status="False",type="Ready"}) by (exported_namespace, name, kind) + on(exported_namespace, name, kind) (max(gotk_reconcile_condition{status="Deleted"}) by (exported_namespace, name, kind)) * 2 == 1
                    for: 15m
                    labels:
                      severity: page
                    annotations:
                      summary: '{{ $labels.kind }} {{ $labels.exported_namespace }}/{{ $labels.name }} reconciliation has been failing for more than ten minutes.'

which you can find historically in the flux2 docs, if you dig past the genesis of the flux2-monitoring-example repo in the website history, where that doc once lived.

Edit: I will have to update that one, as it still uses the Deprecated Resource Metric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants