Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CRD's for Grafana 9 Alert Rules, Alert Groups, Contact Points, Notification Policies and Silences #911

Closed
R-Studio opened this issue Oct 24, 2022 · 22 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@R-Studio
Copy link

Is your feature request related to a problem? Please describe.
No existing problem.

Describe the solution you'd like
Grafana Alerting is a great new feature, but the current grafana-operator does not support a CRD definition for the decoupled alerts, contact points, alert groups, notification policies and silences. The desired solution would be CRDs for each of these so that the new alerting tools are equally accessible via yaml.

kind: GrafanaAlertRule
kind: GrafanaAlertGroup
kind: GrafanaContactPoint
kind: GrafanaNotificationPolicy
kind: GrafanaSilence

Describe alternatives you've considered
A workaround being considered is to manually mount the relevant config files into the running container via VolumeMount from a ConfigMap containing raw file contents. The only other alternative I am aware of is to manually enter these into the Grafana web console, which isn't the point.

This feature request is not new but has been reopened since the previous one was closed: #564

@R-Studio R-Studio added the enhancement New feature or request label Oct 24, 2022
@pb82
Copy link
Collaborator

pb82 commented Nov 8, 2022

This could be interesting for v5 (https://github.com/grafana-operator/grafana-operator-experimental).

@pb82 pb82 transferred this issue from grafana/grafana-operator Nov 8, 2022
@pb82
Copy link
Collaborator

pb82 commented Nov 8, 2022

@R-Studio i've transferred your request to the new repository, I hope you don't mind!

@NissesSenap NissesSenap transferred this issue from grafana-operator/grafana-operator-experimental Mar 8, 2023
@NissesSenap NissesSenap added v5 triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 8, 2023
@R-Studio
Copy link
Author

R-Studio commented May 9, 2023

Any news on this?

@NissesSenap NissesSenap added the help wanted Extra attention is needed label May 11, 2023
@NissesSenap
Copy link
Collaborator

So I have looked through the API of Alerts.

In your request, you mention:

  • kind: GrafanaAlertRule
  • kind: GrafanaAlertGroup
  • kind: GrafanaContactPoint
  • kind: GrafanaNotificationPolicy
  • kind: GrafanaSilence

Thoughts

So my understanding of GrafanaAlertGroup is that it's just a visualization of groups that you define in your AlertRule, so there is no reason for us to provide it as a CRD.

kind: GrafanaSilence I don't think silence is something that a normal user ever would do through a CRD. You just go in and silence the current alert, and you provide a reason and for how long in the UI.

If we added a CRD for GrafanaContactPoint I think we would also need to create a separate CRD for GrafanaContactPointTemplate (message template) or something like that.

When it comes to GrafanaNotificationPolicy we once again have to add another functionality which is Mute timing which is another API. Since they can be reused by multiple Notification policies it's probably best to have a seperate CRD for that one as well, something like GrafanaNotificationMuteTiming.

But all of the above things are possible to solve and could still be "user friendly" as a CRD.

So we have two big issues left.

Big issues

Let's start with the core feature of GrafanaAlertRule, so there is no way to export the rule through the UI (at least not what I can find).
So in reality you would have to write an exact alert rule by hand and make it match with our CRD that probably would look really close to how the Grafana go API sdk looks like, which would be really hard.

The other option is that you write the alert through the UI and then export the alerts by doing API requests to Grafana.
Sure, we could write a small tool to do this, but I don't think it's reasonable for a normal developer to do this as a part of writing an alert.

The second big issue is that Silence contains state, since it's not reasonable for a developer to create a CRD to mute an alert this state needs to be stored else your silence will disappear when your grafana instance is restarted.
So this would more or less force everyone to start using postgres to run grafana.

In short, I think this is a really hard issue to solve, and I don't have any good solutions for it, especially the way of working around alert management.

Possible workaround

So if you are a prometheus-operator user I would recommend using the AlertManager, grafana supports importing external alert managers as a datasource and it got support to write back silences to the alertmanager. That way the alerts silance is stored in the alert manager which already got state. This is how I solved it at my last company.

Alerts management creation isn't amazing in prometheus-operator CRD ether, but if you are used to writing promql it's not a big problem.

The problem with this is of course, what about those of us that uses Grafana for none prometheus datasources like GCP, well to be honest we are kind of out of luck when it comes to the perfect GitOps alerts as code workflow that we all are dreaming of. Sure it might be possible to use GCP as a datasource for your alerts as well (I haven't tried) but you probably want to keep your developers in the grafana UI and not go over to GCP just to mute a alert.

Conclusion

If you use prometheus-operator I suggest using the import alertmanager as an datasource feature in grafana to manage your alerts.
If you have mutliple datasources like grafana you probably won't be able to have the perfect GitOps workflow with alerts as code.
You will probably have to solve it by doing klick ops together with a postgres database together with grafana, else you are probably forced to manage mutliple UIs.

I don't think there is any reasonable way to solve this through the grafana-operator and creating new CRD

Please don't agree with my, I would love to hear an idea how these issues could be solved because I want this feature as well.

What are your thoughts on this @R-Studio and the other people that want this feature.
Do you agree with my current assessment? If not, please share your thoughts on how we should solve these issues.

@apryor6
Copy link
Contributor

apryor6 commented May 11, 2023

So in reality you would have to write an exact alert rule by hand and make it match with our CRD that probably would look really close to how the Grafana go API sdk looks like, which would be really hard.

Could you say more about what this means? My naive assumption (keeping in mind that although I'm a user of this project and grafana itself I'm not an authority on the dev SDKs) is that there exists an API to perform CRUD ops on these rules, as that API must be what the grafana web console is interacting with. Thus the CRD would be a yaml-definable mapping to that same API interface.

As to the statefulness of one of more attributes.. I could see value in having a field be specifyable as "only_on_create" or whatever a good name is to indicate that the value of said field in yaml is to be set on original creation and then no longer managed by the CRD. This would allow as-code creation of alerts and then subsequent management of, for example, silencing to be done via the web console.

FWIW in my use cases the primary value of the CRD is that I have as-code definitions of my grafana objects for better gitops. I want to roll out newly developed alerts and dashboards through each environment with low friction.

@NissesSenap
Copy link
Collaborator

So how it would look isn't an easy question to answer and It would take many hours to come up with a suggestion that would even start to being good enough to discuss.
I don't have that time, but if anyone else does, please go for it. I'm happy to discuss it.

But I created a small app to try to get a feeling for the API.

Looking at the SDK it's very focused on folders and groups which would make this even harder.
Apparently you add your alert to a folder.

My app will list all alerts that you have in a specific group called group and a folder called alerts.

package main

import (
	"fmt"
	"net/url"

	gapi "github.com/grafana/grafana-api-golang-client"
)

func main() {
	// Create gapi config
	config := gapi.Config{
		BasicAuth: url.UserPassword("root", "secret"),
	}

	// Create a new client
	client, err := gapi.New("http://localhost:3000", config)
	if err != nil {
		panic(err)
	}

	folderList, err := client.Folders()
	if err != nil {
		panic(err)
	}
	// Look through the list of folders and get the uid
	for _, folder := range folderList {
		fmt.Println(folder.UID)
		listRules, err := client.AlertRuleGroup(folder.UID, "group")
		if err != nil {
			panic(err)
		}
		fmt.Println(listRules)

	}
}

Basic grafana instance

apiVersion: grafana.integreatly.org/v1beta1
kind: Grafana
metadata:
  name: grafana
  labels:
    dashboards: "grafana"
spec:
  config:
    log:
      mode: "console"
    auth:
      disable_login_form: "false"
    security:
      admin_user: root
      admin_password: secret

I port-forward to my instance and I created an alert manually in my grafana instance and then ran my script.
You will of course have to adapt your folder and group settings to match your alert definition.

It provides the following output.

{
    group rDzwzO8Vz 60 [
        {map[description:if this happens the world has ended
            ] A [
                0xc000180230
            ] Alerting rDzwzO8Vz 1 map[hello:world
            ] NoData 1 group test xu9Qkd8Vz 2023-05-12 06: 14: 35 +0000 UTC 5m 0s  false
        }
    ]
}

So if you want to give the CRD definition I would start reading the specs and playing around a bit with the SDK because I personally don't think the API documentation is great... there are no examples in the output.

https://grafana.com/docs/grafana/latest/developers/http_api/alerting_provisioning/
https://editor.swagger.io/?url=https://raw.githubusercontent.com/grafana/grafana/main/pkg/services/ngalert/api/tooling/post.json

@pbrissaud
Copy link

If you use prometheus-operator I suggest using the import alertmanager as an datasource feature in grafana to manage your alerts.

Totally agree. I'm also using with prom-operator and it works fine.

Grafana alerting is a good feature in the UI but I think that APIs are not mature enough to work with third party software (like the operator).

@siegenthalerroger
Copy link
Contributor

Somehow I missed this issue when I searched for relevant ones but here's my two cents.

  1. Given the scope of this, I believe it may be necessary to build this iteratively. Regardless of any specific design, supporting this will create more CRDs than the project has at all already. It's no small task but it's not necessary to have everything at once IMO.
  2. I believe the Grafana APIs that are necessary are stable enough to begin work on this as only the "provisioning" API is required for what the grafana-operator will do. The "unstable" part are the proxied AlertManager/Prometheus APIs that aren't about defining rules/policies/etc.
  3. It is necessary that one pays attention to the complimentary nature of Grafana's unified alerting that extends (and somewhat replaces parts of) the Prometheus/AlertManager stack. For the grafana-operator this means ensuring that the functionality is designed with the existing prometheus-operator in mind. Specifically, I think in a first step one can completely ignore Grafana's AlertRules and defering that to PrometheusRules.

I'd be willing to draw up a proposal/design for this, however I can't assist with the implementation in the near future sadly.

@NissesSenap
Copy link
Collaborator

@siegenthalerroger , creating a design proposal is a great start.
We have been trying to do it through PRs since it's a simple way of providing feedback.
You can view a design document that we have done earlier here: https://github.com/grafana-operator/grafana-operator/tree/master/docs/docs/proposals

How the document should look isn't set in stone, but if you take a look at, you will see what we are after.

Looking forward to seeing the design :)

@rammelmueller
Copy link

@siegenthalerroger I'd be willing to help out here - I agree, that an iterative approach is good (enough) here. As for implementation - I hope to be able to help out here, but time is also an issue sadly.

@NissesSenap
Copy link
Collaborator

@rammelmueller please create a design document as a first step. Looking forward to a PR around it

@siegenthalerroger
Copy link
Contributor

@rammelmueller I've created an initial proposal that's missing any/all of the specification for the new CRDs. Feel free to reach out if you want to assist.

theSuess added a commit that referenced this issue Dec 13, 2023
This picks up #911 and #1144. The proposal contains three different options for
realizing alerting support in the operator and should serve as a base for
discussion regarding this topic.
theSuess added a commit that referenced this issue Dec 13, 2023
This picks up #911 and #1144. The proposal contains three different options for
realizing alerting support in the operator and should serve as a base for
discussion regarding this topic.
theSuess added a commit that referenced this issue Dec 20, 2023
This picks up #911 and #1144. The proposal contains three different options for
realizing alerting support in the operator and should serve as a base for
discussion regarding this topic.
theSuess added a commit that referenced this issue Jan 9, 2024
This picks up #911 and #1144. The proposal contains three different options for
realizing alerting support in the operator and should serve as a base for
discussion regarding this topic.
@AlexEndris
Copy link

Hi. I just wanted to ask if there is already more than just the proposal, like a decision it's going to happen or not?

@hubeadmin
Copy link
Collaborator

@AlexEndris Hey! it is happening, @theSuess is planning on picking it up in the coming days I believe

@AlexEndris
Copy link

That's great. I'd be willing to support, if I can, although I have 0 go experience. But I surely could play around with test versions.

@NissesSenap
Copy link
Collaborator

We got an initial PR that can be viewed here: #1420

@NissesSenap
Copy link
Collaborator

We would love some feedback around #1420, if you can try it out and see that it's working as you expect that would be great.
We are currently going through the final reviews and hopefully we should merge the feature soon.

@R-Studio
Copy link
Author

R-Studio commented Mar 8, 2024

@NissesSenap I hope to find time to test it in the next few weeks.

@theSuess
Copy link
Member

Closing this issue as initial alerting support has been implemented in #1420.

Follow-up issues for Contact Points & Notification Policies are #1455 and #1454

@AlexEndris
Copy link

There is no issue for implementing GrafanaAlertRule I could track, yet. Right? Or was this already implemented by #1420?

@hubeadmin
Copy link
Collaborator

I believe alert rules are sub-resources of the GrafanaAlertGroup CR, see

@AlexEndris
Copy link

Oh, then I must have missed that completely. Thank you! And Thx @theSuess for the implementation. Came just in time when I needed it! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

10 participants