Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide the ability to configure ruler/alertmanager through prometheus-operator k8s CRDs #2133

Closed
dhbrojas opened this issue Jun 17, 2022 · 27 comments · Fixed by grafana/agent#2604

Comments

@dhbrojas
Copy link
Contributor

dhbrojas commented Jun 17, 2022

Is your feature request related to a problem? Please describe.

It's not uncommon for Prometheus configuration to be described as Custom Resource Definitions (CRDs) within Kubernetes. The Prometheus Operator defines multiple CRDs such as PrometheusRule and AlertmanagerConfig to configure the recording/alerting rules and the alertmanager configuration within Prometheus.

Many people rely on those CRDs to configure their Prometheus instances in Kubernetes environments.

At the moment, I'm unable to configure Mimir using Prometheus Operator CRDs. I have to resort to configuring components manually using the HTTP API or mimirtool.

Describe the solution you'd like

I would like Mimir to automatically discover Prometheus Operator CRDs (PrometheusRule and AlertmanagerConfig) within my cluster and to apply the configuration they hold to the different components. Potentially through a Mimir Operator for Kubernetes.

I would like Grafana Mimir to distribute its own PrometheusRule resources and that it be included within the helm chart as an option (much like the mimir-distributed chart has a serviceMonitor.enable option for ServiceMonitor CRDs) so that upon installing the helm release, you automatically have Mimir rules and alerts setup without any additional configuration. This would also mean that the rules could be upgraded automatically without users having to manually copy the rules.yml and alerts.yml config file Mimir provides.

Describe alternatives you've considered

I've considered configuring recording/alerting rules manually using mimirtool but this approach fails to satisfy our requirements fully.

I've considered configuring recording/alerting rules using the local storage approach as described here but this means I would have to copy the configuration stored inside each PrometheusRule resource and put it in a k8s config map potentially missing out on future upgrades or changes to these configs (given they are sometimes created by third party helm charts).

Additional context

This issue is related to this slack thread.

@dhbrojas dhbrojas changed the title Provide the ability to configure Mimir through PrometheusRules k8s CRD Provide the ability to configure ruler through PrometheusRules k8s CRD Jun 17, 2022
@dhbrojas
Copy link
Contributor Author

dhbrojas commented Jun 17, 2022

Dumping an idea here, this problem could potentially be solved by introducing a Mimir Operator (a Mimir equivalent of the Prometheus Operator for Prometheus) which would discover PrometheusRules/AlertmanagerConfig resources within the cluster and update the ruler/alertmanager component's configuration using the HTTP API. This has the advantage of keeping Kubernetes specific logic out of the Mimir core. Maybe this behaviour can be added to the Grafana Agent Operator directly?

Besides that, one problem is that PrometheusRules resources aren't bound to any tenant whereas Mimir expects alerting/recording rules to be bound to a specific tenant. So if Mimir were to automatically discover PrometheusRules resources it would have to know to which tenant this set of rules belongs to.

@replay
Copy link
Contributor

replay commented Jun 20, 2022

Dumping an idea here, this problem could be solved by introducing a mimir-operator (much like Prometheus Operator) which would discover PrometheusRules resources within the cluster and update the ruler component's configuration using the ruler HTTP API.

This has the advantage of keeping Kubernetes specific logic out of the Mimir core and it would also allow more resources to be discovered by Mimir in the future to auto-configure itself (e.g. AlertmanagerConfig CRD).

We generally want to keep Mimir agnostic to whether it's running on kubernetes or bare metal, so I don't think we should add kubernetes specific features into Mimir itself. Your suggestion of a mimir-operator to read the CRDs and apply them via Mimir's API sounds like a good idea to me.

@yevhen-harmonizehr
Copy link

any ETA on this one? Because whole mimir stack not usable unless we can apply gitops principles to it.

@dimitarvdimitrov
Copy link
Contributor

there are two open PRs related to this - #2134 and #2609. There's no feedback from the author on the first one, but I think we can resurrect the second one.

@boniek83
Copy link
Contributor

boniek83 commented Oct 5, 2022

It's not mentioned by the OP but alertmanagerconfigs CRD should be supported as well. I want to make use of community made rules and alerts provided by various helm charts and not have to maintain them myself.

@dhbrojas dhbrojas changed the title Provide the ability to configure ruler through PrometheusRules k8s CRD Provide the ability to configure ruler/alertmanager through prometheus-operator k8s CRDs Oct 6, 2022
@dhbrojas
Copy link
Contributor Author

dhbrojas commented Oct 6, 2022

@boniek83 you are right. I updated the issue title and contents to reflect that.

@dimitarvdimitrov
Copy link
Contributor

I think the two (AM config and rules) are separate feature requests. @johannaratliff is working on adding rules support as part of #2609

@dhbrojas
Copy link
Contributor Author

dhbrojas commented Oct 6, 2022

@dimitarvdimitrov I'm unsure what you mean by "adding rules support". Do you mean

OR

  • Mimir will have a mechanism for discovering and configuring itself using the PrometheusRule resources found in a cluster.

I believe the scope of this issue is more focused on the latter. It's a request for a Prometheus Operator equivalent for Mimir rather than wanting Mimir to package its own rules config within a PrometheusRule CRD (even though it is stated in the issue and would be amazing). Hence I think AM config and rules are related.

If you think it's better to differentiate the two, I will happily update this issue and open a separate one for the AlertmanagerConfig CRD. Cheers!

@boniek83
Copy link
Contributor

boniek83 commented Oct 6, 2022

For me it's definitely the latter. Since promtetheusrule is an existing concept that is commonly used throughout many helm charts (it's basically a standard way to package and deliver prometheus rules) Mimir should not replace an existing one with its own. Same with alertmanagerconfigs. You can have an additional CRDS that have more functionality but you have to support these IMO.

@dhbrojas
Copy link
Contributor Author

dhbrojas commented Oct 6, 2022

Sorry @boniek83 , my latest message was a bit unclear. When I said

Mimir will deliver its own PrometheusRule CRD

I meant its own instance of a PrometheusRule (check #2134) not its own definition of PrometheusRule. I don't think creating a new CRD is on the table. This issue focuses on making the Prometheus Operator CRDs work with Mimir.

@dimitarvdimitrov
Copy link
Contributor

yeah, sorry I was conflating the two. I meant

Mimir will be packaged with PrometheusRules resources in the helm chart (similar to what is being done in #2134)

@pdf
Copy link
Contributor

pdf commented Nov 19, 2022

Is there any work on making this a reality that can be tracked? AFAICT #2134 is just about deploying the mixins into the cluster via the Helm chart, but this will still require a Prometheus instance to do all the work, and seems largely unrelated to this issue of providing a mechanism to configure Mimir Ruler/AlertManager via CRDs?

The lack of ability to configure these Mimir components sensibly in the cluster makes it prohibitive to actually use them, resulting in an awkward architecture where pretty much everything has to be pushed through Prometheus instances to get rules applied, and making the Mimir components redundant at best.

@dhbrojas
Copy link
Contributor Author

@pdf Is there any work on making this a reality that can be tracked?

As far as I'm aware, there is no work in progress on this yet. I recently scouted the set of PRs in search for that. I raised the issue during a Mimir community call in June and the team indicated that it was something they had talked about but that we may not see implemented before a "couple releases" due to the complicated nature of this feature request (which is totally understandable).

It is also my understanding that, without strong Kubernetes integration, some components are unusable and dependence on Prometheus is inevitable.

@Logiraptor
Copy link
Contributor

Logiraptor commented Nov 28, 2022

@pdf @rojas-diego @boniek83 @yevhen-harmonizehr

I think there are a couple ways this could work, and I'm curious what you think. This is not meant to be a commitment that we will build this feature, but I do think we can take a step closer by pinning down a design. Also, I refer often to rules below, but I mean for this to apply equally to rules and alertmanager configuration.

Option 1: An Operator

This is the most common request I've seen: to build a mimir operator that can automatically configure one or more Mimir instances based on CRDs. This is a large undertaking and feels a little strange considering Mimir is meant to be a centralized, multi-tenant time series database. Feel free to disagree on this point, just my personal opinion.

If instead the operator is only responsible for creating a configmap, then it feels like overkill to build a new operator for this IMO. The prometheus operator already includes all the code for this, but instead of leaving a ConfigMap, it also deploys prometheus servers with the ConfigMap mounted already. I would prefer to leave that code in the prometheus operator if we can - taking a quick look it's already splitting rules to avoid hitting the 1MB size limit, cleans up old resources, etc.

Option 2: A K8s backend for Ruler / Alertmanager storage

Basically, add a kubernetes backend to the ruler and alertmanager storage that will pull configuration directly from k8s CRDs instead of object storage. This would be a read-only implementation like the local backend, so you would need to place all rules in CRDs, and mimirtool would no longer be useful to upload rules. Without prototyping it, this seems simpler to me, and doesn't involve all the complexity of deploying Mimir itself.

Other considerations

With either approach it's clear from this issue that 100% API compatibility with the prometheus operator is a design goal, and there are two missing things as far as I can tell:

  1. How does the system know to which tenant a rule belongs?
    a. Maybe a configurable label defined on the PrometheusRule/etc resources?
    b. Maybe we define a new CRD like MimirTenant that includes label selectors similar to the existing Prometheus CRD? (example below)
  2. How do we want to handle authentication? Updating / modifying rules is often protected via a reverse proxy (or the enterprise gateway in the case of Grafana Enterprise Metrics). Allowing rules to be modified via the kubernetes API effectively broadens the scope of authorization. How do we handle this?
    a. Maybe we just leave it to the user and say it's up to them to secure their k8s RBAC?
    b. Maybe we require some authentication to be specified in the PrometheusRule CRD itself? (How can we do this without breaking compatibility?)
    c. Maybe we define a new CRD like MimirTenant that includes everything including authentication

⚠️ totally made up design, not something that works today ⚠️

apiVersion: grafana.com/v1alpha1
kind: MimirTenant
metadata:
  name: <name>
  namespace: <namespace>
spec:
  tenant: <tenant id>
  rules:
    matchLabels:
      <label>: <value>
  alertmanager:
    matchLabels:
      <label>: <value>

The idea is that this would tell Mimir which PrometheusRule resources to consider and against which tenant the rules should evaluate via kubernetes label selectors. In the future we could extend this MimirTenant to allow overriding tenant limits as well.

The idea would be that you configure RBAC so that Mimir operators can create MimirTenant resources, but Mimir users can only create PrometheusRule in certain namespaces / labels, etc. This way they can't mess with each other's configuration and multitenancy is respected.

I'm curious what your thoughts are!

@pdf
Copy link
Contributor

pdf commented Nov 29, 2022

@Logiraptor

Option 1: An Operator

When I was thinking about how I might solve this in the absense of an official solution, my thoughts ran along the lines that this might just end up being a small translation layer that splats out translated CRD content in mimirtool format and calls mimirtool to sync the rules/alerts into the cluster.

Option 2: A K8s backend for Ruler / Alertmanager storage

I'm (pleasantly) surprised to hear that you think this may be a smaller effort than an operator. Native integration as a backend sounds like a rather nice approach, and I think a single data source is a fair requirement - certainly all of our usage would preferably be deployed as k8s resources.

I think most people request/suggest an operator simply because that's the most common method for solving these sorts of tasks, not necessarily because it's optimal.

Other considerations

1.a. The Prometheus operator uses a similar mechanism to restrict the search space for a particular Prometheus instance. Probably want the option to restrict lookups by namespace in addition to labels here though.
1.b. Sounds sensible to me 👍

2.. I'm not familiar with how backends are implemented, but if Option 2 above is selected, authentication considerations largely disappear since the backend would just communicate with the k8s API and update configuration in-process, right? For the operator option, I think secretRefs in the MimirTenant CRD would likely be the way to go.

For our use-case, we're not super-interested in multi-tenant. That said, restricting the creation of MimirTenant resources via RBAC seems fine, but I don't think there's any way to stop the creation of a PrometheusRule with labels that would attach it to an arbitrary tenant without an admission controller, which is why I'd suggest the option of restricting lookups to particular namespaces for a tenant. I suspect namespace segregation is likely to be adequate for the majority of multi-tenant deployments, but that's hard for me to judge. If users want more fine-grained policies, they'll need to deploy some sort of policy agent.

@Logiraptor
Copy link
Contributor

Thanks @pdf!

Another idea that came up internally is this: allow the Grafana Agent to read those CRDs and configure Mimir via the existing Mimir Ruler API.

@Logiraptor
Copy link
Contributor

It's hackathon week here at Grafana Labs, so I'm looking into this a bit more. Specifically, I'm exploring the possibility of making the Grafana Agent capable of synchronizing the CRDs with Mimir's Ruler. One benefit of this is it could work for users who don't control the k8s cluster where Mimir is running, for example Grafana Cloud customers who want to configure their cloud ruler via CRDs on their "home" cluster.

Here’s an interesting design issue:

Mimir really has 3 levels of organization for rules: tenant, rule namespace, rule group. The prometheus operator has 3 as well: k8s namespace, PrometheusRule crd, rule group. I don't think it's appropriate to just map these 1:1 (other than rule group). So I'm trying to find a bit more flexibility to support more use cases.

The plan so far has been:

  • Mimir tenant is static in the configuration. Multiple tenants can be supported by duplicating the component in the agent config. This one feels likely to stay static since most community members and Grafana Cloud customers use it this way.
  • Mimir rule namespace is static in the configuration. Multiple rule namespaces can be supported by duplicating the component in the agent config. This one I'm not sure about, more on that below.
  • The config takes label selectors for both namespace and rules, exactly like the Prometheus crd from prometheus-operator

The issue with a static rule namespace:

  • It's possible to define the same rule group in multiple PrometheusRule crds. How should the agent go about merging this?

The issue with mapping rule namespace to PrometheusRule resources:

  • This would mean that all rules have to be defined in CRDs.
  • One of the use cases I'm trying to preserve is sourcing rules from both Grafana Cloud integrations (which apply rules on the backend via Mimir's API), and the CRDs.
  • That use case would be broken if the agent is going to delete anything it doesn't recognize.

Looking into the prometheus operator, it does a few things:

  • It creates a separate rule file for each PrometheusRule object found
  • It can optionally add a label pointing back to the namespace where the PrometheusRule was found
  • The rule files are named like <namespace>-<name>-<uid>

One option is to use the same mapping as the prometheus operator, and create Mimir rule namespaces named <namespace>-<name>-<uid> or even just <namespace>-<name> since that should be unique. In this case I would want to find some way to distinguish operator-created namespaces from other namespaces so we don't accidentally delete something we shouldn't.

@pdf
Copy link
Contributor

pdf commented Dec 6, 2022

If supporting Grafana Cloud is a goal (and it probably should be, but I didn't want to muddy this issue with that possibility in earlier discussions) then the Agent is probably the right place for this.

If the uid is not included in the mapping, are we certain to do the right thing if the PrometheusRule is deleted and recreated? Might also aid any debugging efforts to have the uid included so that there's a hard match between resource and rule.

As far as distinguishing between operator-created namespaces and not, could some sort of hash or static value be appended/prepended to the ns perhaps?

@Logiraptor
Copy link
Contributor

Logiraptor commented Dec 9, 2022

Hey everyone, here's an update on my hackathon project: I managed to get a lot of this working for PrometheusRule CRDs. No Alertmanager support yet to keep scope down for the 1 week duration, but I still think it's possible to do.

I've submitted a PR for the agent here: grafana/agent#2604

I've spoken with @rfratto on the agent squad and we'll both be out for the holidays for a few weeks, but we do plan to continue work on that issue once we're available again in January. Follow that PR for updates 😄

@Logiraptor
Copy link
Contributor

Hey everyone, ~final update here: we've added support to the Grafana Agent for configuring Mimir's ruler via PrometheusRule CRDs.

  • No support yet for the AlertmanagerConfig CRDs, but only because this was a hackathon project and I need to get back to other work on Mimir. I'm sure the agent team would be open to reviewing a PR to add that if anyone feels inclined to help 😉
  • The PrometheusRule support will be released in the next version of the agent, and you can preview the unreleased docs here: https://grafana.com/docs/agent/next/flow/reference/components/mimir.rules.kubernetes/.
  • I would like to get this added to the Mimir helm chart so it all just works out of the box. There are still some questions about the best way to deploy the agent when using Flow, so this is left out for now.

@rafilkmp3
Copy link
Contributor

rafilkmp3 commented May 18, 2023

@Logiraptor can you share a full example of CRD of grafana agent to be used by grafana-agent-operator with this component enabled?

@zakariais
Copy link

@Logiraptor Is this added to the mimir helm chart? I checked the new helm chart cannot see any option of mimir.rules in grafana agent.

@Rohlik
Copy link

Rohlik commented Aug 29, 2023

I'm also unable to set/enable this new feature via the Mimir helm chart 😞.

@pdf
Copy link
Contributor

pdf commented Sep 6, 2023

@zakariais @Rohlik this is implemented as a component for Grafana Agent (flow mode), configure the mimir.rules.kubernetes component as part of deploying the agent, using the grafana-agent helm chart, not as part of deploying mimir.

@MXfive
Copy link

MXfive commented Jun 27, 2024

@Logiraptor Given that Grafana Agent has already been deprecated and given an EOL date for next year, plus the fact its replacement is simply an OTel Collector distribution, which does not and is unlikely to support this feature, I think it's worth reopening this?

@pdf
Copy link
Contributor

pdf commented Jun 28, 2024

@MXfive that doesn't appear to be accurate - see the Alloy (the replacement you reference) components, where the following component appears to include the same functionality as the Grafana Agent component that handled this previously:

https://grafana.com/docs/alloy/latest/reference/components/mimir.rules.kubernetes/

@MXfive
Copy link

MXfive commented Jun 28, 2024

@MXfive that doesn't appear to be accurate - see the Alloy (the replacement you reference) components, where the following component appears to include the same functionality as the Grafana Agent component that handled this previously:

https://grafana.com/docs/alloy/latest/reference/components/mimir.rules.kubernetes/

Oh nice, I wasn't able to find yesterday. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.