Implement CRD's for Grafana 9 Alert Rules, Alert Groups, Contact Points, Notification Policies and Silences #911

R-Studio · 2022-10-24T06:05:21Z

Is your feature request related to a problem? Please describe.
No existing problem.

Describe the solution you'd like
Grafana Alerting is a great new feature, but the current grafana-operator does not support a CRD definition for the decoupled alerts, contact points, alert groups, notification policies and silences. The desired solution would be CRDs for each of these so that the new alerting tools are equally accessible via yaml.

kind: GrafanaAlertRule
kind: GrafanaAlertGroup
kind: GrafanaContactPoint
kind: GrafanaNotificationPolicy
kind: GrafanaSilence

Describe alternatives you've considered
A workaround being considered is to manually mount the relevant config files into the running container via VolumeMount from a ConfigMap containing raw file contents. The only other alternative I am aware of is to manually enter these into the Grafana web console, which isn't the point.

This feature request is not new but has been reopened since the previous one was closed: #564

pb82 · 2022-11-08T13:12:40Z

This could be interesting for v5 (https://github.com/grafana-operator/grafana-operator-experimental).

pb82 · 2022-11-08T13:13:31Z

@R-Studio i've transferred your request to the new repository, I hope you don't mind!

R-Studio · 2023-05-09T12:11:46Z

Any news on this?

NissesSenap · 2023-05-11T21:09:23Z

So I have looked through the API of Alerts.

In your request, you mention:

kind: GrafanaAlertRule
kind: GrafanaAlertGroup
kind: GrafanaContactPoint
kind: GrafanaNotificationPolicy
kind: GrafanaSilence

Thoughts

So my understanding of GrafanaAlertGroup is that it's just a visualization of groups that you define in your AlertRule, so there is no reason for us to provide it as a CRD.

kind: GrafanaSilence I don't think silence is something that a normal user ever would do through a CRD. You just go in and silence the current alert, and you provide a reason and for how long in the UI.

If we added a CRD for GrafanaContactPoint I think we would also need to create a separate CRD for GrafanaContactPointTemplate (message template) or something like that.

When it comes to GrafanaNotificationPolicy we once again have to add another functionality which is Mute timing which is another API. Since they can be reused by multiple Notification policies it's probably best to have a seperate CRD for that one as well, something like GrafanaNotificationMuteTiming.

But all of the above things are possible to solve and could still be "user friendly" as a CRD.

So we have two big issues left.

Big issues

Let's start with the core feature of GrafanaAlertRule, so there is no way to export the rule through the UI (at least not what I can find).
So in reality you would have to write an exact alert rule by hand and make it match with our CRD that probably would look really close to how the Grafana go API sdk looks like, which would be really hard.

The other option is that you write the alert through the UI and then export the alerts by doing API requests to Grafana.
Sure, we could write a small tool to do this, but I don't think it's reasonable for a normal developer to do this as a part of writing an alert.

The second big issue is that Silence contains state, since it's not reasonable for a developer to create a CRD to mute an alert this state needs to be stored else your silence will disappear when your grafana instance is restarted.
So this would more or less force everyone to start using postgres to run grafana.

In short, I think this is a really hard issue to solve, and I don't have any good solutions for it, especially the way of working around alert management.

Possible workaround

So if you are a prometheus-operator user I would recommend using the AlertManager, grafana supports importing external alert managers as a datasource and it got support to write back silences to the alertmanager. That way the alerts silance is stored in the alert manager which already got state. This is how I solved it at my last company.

Alerts management creation isn't amazing in prometheus-operator CRD ether, but if you are used to writing promql it's not a big problem.

The problem with this is of course, what about those of us that uses Grafana for none prometheus datasources like GCP, well to be honest we are kind of out of luck when it comes to the perfect GitOps alerts as code workflow that we all are dreaming of. Sure it might be possible to use GCP as a datasource for your alerts as well (I haven't tried) but you probably want to keep your developers in the grafana UI and not go over to GCP just to mute a alert.

Conclusion

If you use prometheus-operator I suggest using the import alertmanager as an datasource feature in grafana to manage your alerts.
If you have mutliple datasources like grafana you probably won't be able to have the perfect GitOps workflow with alerts as code.
You will probably have to solve it by doing klick ops together with a postgres database together with grafana, else you are probably forced to manage mutliple UIs.

I don't think there is any reasonable way to solve this through the grafana-operator and creating new CRD

Please don't agree with my, I would love to hear an idea how these issues could be solved because I want this feature as well.

What are your thoughts on this @R-Studio and the other people that want this feature.
Do you agree with my current assessment? If not, please share your thoughts on how we should solve these issues.

apryor6 · 2023-05-11T23:14:05Z

So in reality you would have to write an exact alert rule by hand and make it match with our CRD that probably would look really close to how the Grafana go API sdk looks like, which would be really hard.

Could you say more about what this means? My naive assumption (keeping in mind that although I'm a user of this project and grafana itself I'm not an authority on the dev SDKs) is that there exists an API to perform CRUD ops on these rules, as that API must be what the grafana web console is interacting with. Thus the CRD would be a yaml-definable mapping to that same API interface.

As to the statefulness of one of more attributes.. I could see value in having a field be specifyable as "only_on_create" or whatever a good name is to indicate that the value of said field in yaml is to be set on original creation and then no longer managed by the CRD. This would allow as-code creation of alerts and then subsequent management of, for example, silencing to be done via the web console.

FWIW in my use cases the primary value of the CRD is that I have as-code definitions of my grafana objects for better gitops. I want to roll out newly developed alerts and dashboards through each environment with low friction.

NissesSenap · 2023-05-12T06:58:09Z

So how it would look isn't an easy question to answer and It would take many hours to come up with a suggestion that would even start to being good enough to discuss.
I don't have that time, but if anyone else does, please go for it. I'm happy to discuss it.

But I created a small app to try to get a feeling for the API.

Looking at the SDK it's very focused on folders and groups which would make this even harder.
Apparently you add your alert to a folder.

My app will list all alerts that you have in a specific group called group and a folder called alerts.

package main

import (
	"fmt"
	"net/url"

	gapi "github.com/grafana/grafana-api-golang-client"
)

func main() {
	// Create gapi config
	config := gapi.Config{
		BasicAuth: url.UserPassword("root", "secret"),
	}

	// Create a new client
	client, err := gapi.New("http://localhost:3000", config)
	if err != nil {
		panic(err)
	}

	folderList, err := client.Folders()
	if err != nil {
		panic(err)
	}
	// Look through the list of folders and get the uid
	for _, folder := range folderList {
		fmt.Println(folder.UID)
		listRules, err := client.AlertRuleGroup(folder.UID, "group")
		if err != nil {
			panic(err)
		}
		fmt.Println(listRules)

	}
}

Basic grafana instance

apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana
  labels:
    dashboards: "grafana"
spec:
  config:
    log:
      mode: "console"
    auth:
      disable_login_form: "false"
    security:
      admin_user: root
      admin_password: secret

I port-forward to my instance and I created an alert manually in my grafana instance and then ran my script.
You will of course have to adapt your folder and group settings to match your alert definition.

It provides the following output.

{
    group rDzwzO8Vz 60 [
        {map[description:if this happens the world has ended
            ] A [
                0xc000180230
            ] Alerting rDzwzO8Vz 1 map[hello:world
            ] NoData 1 group test xu9Qkd8Vz 2023-05-12 06: 14: 35 +0000 UTC 5m 0s  false
        }
    ]
}

So if you want to give the CRD definition I would start reading the specs and playing around a bit with the SDK because I personally don't think the API documentation is great... there are no examples in the output.

https://grafana.com/docs/grafana/latest/developers/http_api/alerting_provisioning/
https://editor.swagger.io/?url=https://raw.githubusercontent.com/grafana/grafana/main/pkg/services/ngalert/api/tooling/post.json

pbrissaud · 2023-05-12T13:46:29Z

If you use prometheus-operator I suggest using the import alertmanager as an datasource feature in grafana to manage your alerts.

Totally agree. I'm also using with prom-operator and it works fine.

Grafana alerting is a good feature in the UI but I think that APIs are not mature enough to work with third party software (like the operator).

siegenthalerroger · 2023-06-29T09:48:07Z

Somehow I missed this issue when I searched for relevant ones but here's my two cents.

Given the scope of this, I believe it may be necessary to build this iteratively. Regardless of any specific design, supporting this will create more CRDs than the project has at all already. It's no small task but it's not necessary to have everything at once IMO.
I believe the Grafana APIs that are necessary are stable enough to begin work on this as only the "provisioning" API is required for what the grafana-operator will do. The "unstable" part are the proxied AlertManager/Prometheus APIs that aren't about defining rules/policies/etc.
It is necessary that one pays attention to the complimentary nature of Grafana's unified alerting that extends (and somewhat replaces parts of) the Prometheus/AlertManager stack. For the grafana-operator this means ensuring that the functionality is designed with the existing prometheus-operator in mind. Specifically, I think in a first step one can completely ignore Grafana's AlertRules and defering that to PrometheusRules.

I'd be willing to draw up a proposal/design for this, however I can't assist with the implementation in the near future sadly.

NissesSenap · 2023-06-29T10:11:41Z

@siegenthalerroger , creating a design proposal is a great start.
We have been trying to do it through PRs since it's a simple way of providing feedback.
You can view a design document that we have done earlier here: https://github.com/grafana-operator/grafana-operator/tree/master/docs/docs/proposals

How the document should look isn't set in stone, but if you take a look at, you will see what we are after.

Looking forward to seeing the design :)

rammelmueller · 2023-07-05T14:11:16Z

@siegenthalerroger I'd be willing to help out here - I agree, that an iterative approach is good (enough) here. As for implementation - I hope to be able to help out here, but time is also an issue sadly.

NissesSenap · 2023-07-05T16:14:38Z

@rammelmueller please create a design document as a first step. Looking forward to a PR around it

siegenthalerroger · 2023-07-06T12:50:36Z

@rammelmueller I've created an initial proposal that's missing any/all of the specification for the new CRDs. Feel free to reach out if you want to assist.

This picks up #911 and #1144. The proposal contains three different options for realizing alerting support in the operator and should serve as a base for discussion regarding this topic.

AlexEndris · 2024-02-02T13:17:15Z

Hi. I just wanted to ask if there is already more than just the proposal, like a decision it's going to happen or not?

hubeadmin · 2024-02-02T13:36:08Z

@AlexEndris Hey! it is happening, @theSuess is planning on picking it up in the coming days I believe

AlexEndris · 2024-02-02T13:39:39Z

That's great. I'd be willing to support, if I can, although I have 0 go experience. But I surely could play around with test versions.

NissesSenap · 2024-02-15T13:01:47Z

We got an initial PR that can be viewed here: #1420

NissesSenap · 2024-03-05T14:49:33Z

We would love some feedback around #1420, if you can try it out and see that it's working as you expect that would be great.
We are currently going through the final reviews and hopefully we should merge the feature soon.

R-Studio · 2024-03-08T16:39:51Z

@NissesSenap I hope to find time to test it in the next few weeks.

theSuess · 2024-03-13T08:44:04Z

Closing this issue as initial alerting support has been implemented in #1420.

Follow-up issues for Contact Points & Notification Policies are #1455 and #1454

AlexEndris · 2024-04-08T07:42:48Z

There is no issue for implementing GrafanaAlertRule I could track, yet. Right? Or was this already implemented by #1420?

hubeadmin · 2024-04-08T08:01:12Z

I believe alert rules are sub-resources of the GrafanaAlertGroup CR, see

grafana-operator/api/v1beta1/grafanaalertrulegroup_types.go

Line 58 in ba4734b

type AlertRule struct {

AlexEndris · 2024-04-08T08:13:20Z

Oh, then I must have missed that completely. Thank you! And Thx @theSuess for the implementation. Came just in time when I needed it! :)

R-Studio added the enhancement New feature or request label Oct 24, 2022

pb82 transferred this issue from grafana/grafana-operator Nov 8, 2022

NissesSenap mentioned this issue Nov 15, 2022

Design/Managing External Grafana Instance grafana-operator/grafana-operator-experimental#39

Merged

NissesSenap transferred this issue from grafana-operator/grafana-operator-experimental Mar 8, 2023

NissesSenap added v5 triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 8, 2023

NissesSenap added the help wanted Extra attention is needed label May 11, 2023

NissesSenap mentioned this issue Jun 27, 2023

Support for Unified Alerting #1125

Closed

siegenthalerroger mentioned this issue Jul 6, 2023

Draft: Unified Alerting support proposal #1144

Closed

theSuess mentioned this issue Dec 13, 2023

docs: add proposal for grafana alerting #1349

Merged

hubeadmin assigned theSuess Feb 2, 2024

theSuess closed this as completed Mar 13, 2024

nlamirault mentioned this issue Nov 14, 2024

Add support for Mute Timing #1761

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement CRD's for Grafana 9 Alert Rules, Alert Groups, Contact Points, Notification Policies and Silences #911

Implement CRD's for Grafana 9 Alert Rules, Alert Groups, Contact Points, Notification Policies and Silences #911

R-Studio commented Oct 24, 2022

pb82 commented Nov 8, 2022

pb82 commented Nov 8, 2022

R-Studio commented May 9, 2023

NissesSenap commented May 11, 2023

apryor6 commented May 11, 2023

NissesSenap commented May 12, 2023

pbrissaud commented May 12, 2023

siegenthalerroger commented Jun 29, 2023

NissesSenap commented Jun 29, 2023

rammelmueller commented Jul 5, 2023

NissesSenap commented Jul 5, 2023

siegenthalerroger commented Jul 6, 2023

AlexEndris commented Feb 2, 2024

hubeadmin commented Feb 2, 2024

AlexEndris commented Feb 2, 2024

NissesSenap commented Feb 15, 2024

NissesSenap commented Mar 5, 2024

R-Studio commented Mar 8, 2024

theSuess commented Mar 13, 2024

AlexEndris commented Apr 8, 2024

hubeadmin commented Apr 8, 2024

AlexEndris commented Apr 8, 2024

Implement CRD's for Grafana 9 Alert Rules, Alert Groups, Contact Points, Notification Policies and Silences #911

Implement CRD's for Grafana 9 Alert Rules, Alert Groups, Contact Points, Notification Policies and Silences #911

Comments

R-Studio commented Oct 24, 2022

pb82 commented Nov 8, 2022

pb82 commented Nov 8, 2022

R-Studio commented May 9, 2023

NissesSenap commented May 11, 2023

Thoughts

Big issues

Possible workaround

Conclusion

apryor6 commented May 11, 2023

NissesSenap commented May 12, 2023

pbrissaud commented May 12, 2023

siegenthalerroger commented Jun 29, 2023

NissesSenap commented Jun 29, 2023

rammelmueller commented Jul 5, 2023

NissesSenap commented Jul 5, 2023

siegenthalerroger commented Jul 6, 2023

AlexEndris commented Feb 2, 2024

hubeadmin commented Feb 2, 2024

AlexEndris commented Feb 2, 2024

NissesSenap commented Feb 15, 2024

NissesSenap commented Mar 5, 2024

R-Studio commented Mar 8, 2024

theSuess commented Mar 13, 2024

AlexEndris commented Apr 8, 2024

hubeadmin commented Apr 8, 2024

AlexEndris commented Apr 8, 2024