Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loki-mixin: Complex configuration needed compared to other mixins #4838

Closed
DaveOHenry opened this issue Nov 29, 2021 · 20 comments · Fixed by #6189
Closed

loki-mixin: Complex configuration needed compared to other mixins #4838

DaveOHenry opened this issue Nov 29, 2021 · 20 comments · Fixed by #6189

Comments

@DaveOHenry
Copy link
Contributor

DaveOHenry commented Nov 29, 2021

We are using Helm and there are currently quite a few things that have to be tweaked to make the dashboards work:

  • set showMultiCluster:: false for loki-logs.json and loki-operational.json
  • override clusterLabel:: and matchers:: for every dashboard that uses these fields
  • override the dashboard variables (especially cluster and namespace)
  • override jobMatcher(job):: and namespaceMatcher():: functions (currently used by loki-deletion.json, loki-reads-resources.json, loki-retention.json and loki-writes-resources.json)

To summarize my findings:

  1. The main issues are the hardcoded "cluster" label and job name "($namespace)/$job"
  2. I haven't found a way to make loki-reads-resources.json and loki-writes-resources.json work other than overriding the complete dashboard definition, but this would kind of defeat the purpose of a mixin, wouldn't it?
  3. There are some issues with the CPU and Memory panels, but I think this has to do with the deployment mode we are using (loki-distributed Helm chart): Looks like all of our containers are simply named "loki", see also [loki-distributed] Rename containers to fix some dashboard panels from loki-mixin helm-charts#933
  4. I don't know what this cortex-gw is. Never heard of it before and ignored it for now.

I'm not using the loki-deletes.json and loki-mixin-recording-rules.json dashboards currently, but the others are mostly working and are a great starting point to explore various aspects of Loki.
My jsonnet experience is pretty basic, but I still hope that this config.libsonnet overrides help some people to get started more quickly and maybe save some headaches:

local lokimixin = import './vendor/github.com/grafana/loki/production/loki-mixin/mixin.libsonnet';
local utils = import 'mixin-utils/utils.libsonnet';

lokimixin
{
  # override functions from dashboard-utils.libsonnet
  jobMatcher(job)::
    'monitoring_stack=~"$cluster", job=~"%s"' % job,
  namespaceMatcher()::
    'monitoring_stack=~"$cluster", namespace=~"$namespace"',
  _config+:: {
    // Tags for dashboards.
    tags: ['loki'],

    // The label used to differentiate between different application instances (i.e. 'pod' in a kubernetes install).
    per_instance_label: 'pod',

    // The label used to differentiate between different nodes (i.e. servers).
    per_node_label: 'instance',
  },
  grafanaDashboards+: {
    'loki-chunks.json'+:{
      clusterLabel:: 'monitoring_stack',
      matchers:: {
        ingester: [utils.selector.re('job', 'loki-ingester')],
      },
      templating+: {
        list:
          std.map(
            function(item)
              if item.name == 'cluster' then
                item+ {
                  query: 'label_values(loki_build_info, monitoring_stack)'
                }
              else
              if item.name == 'namespace' then
                item+ {
                  query: 'label_values(loki_build_info{monitoring_stack=~"$cluster"}, namespace)'
                }
              else
                item,
            super.list
          ),
      },
    },
    'loki-deletion.json'+:{
      clusterLabel:: 'monitoring_stack',
      matchers:: {
        ingester: [utils.selector.re('job', 'loki-ingester')],
      },
      templating+: {
        list:
          std.map(
            function(item)
              if item.name == 'cluster' then
                item+ {
                  query: 'label_values(loki_build_info, monitoring_stack)'
                }
              else
              if item.name == 'namespace' then
                item+ {
                  query: 'label_values(loki_build_info{monitoring_stack=~"$cluster"}, namespace)'
                }
              else
                item,
            super.list
          ),
      },
    },
    'loki-logs.json'+:{
      showMultiCluster:: false,
      clusterLabel:: 'monitoring_stack',
      templating+: {
        list:
          std.map(
            function(item)
              if item.name == 'cluster' then
                item+ {
                  query: 'label_values(loki_build_info, monitoring_stack)'
                }
              else
              if item.name == 'namespace' then
                item+ {
                  query: 'label_values(loki_build_info{monitoring_stack=~"$cluster"}, namespace)'
                }
              else
              if item.name == 'deployment' then
                item+ {
                  query: 'label_values(kube_deployment_created{monitoring_stack="$cluster", namespace="$namespace"}, deployment)'
                }
              else
              if item.name == 'pod' then
                item+ {
                  query: 'label_values(kube_pod_container_info{monitoring_stack="$cluster", namespace="$namespace", pod=~"$deployment.*"}, pod)'
                }
              else
              if item.name == 'container' then
                item+ {
                  query: 'label_values(kube_pod_container_info{monitoring_stack="$cluster", namespace="$namespace", pod=~"$pod", pod=~"$deployment.*"}, container)'
                }
              else
              # preselect correct data source
              if item.name == 'logs' then
                item+ {
                  regex: '.*loki-prd'
                }
              else
                item,
            super.list
          ),
      },
    },
    'loki-operational.json'+:{
      clusterLabel:: 'monitoring_stack',
      showMultiCluster:: false,
      matchers:: {
        cortexgateway: [utils.selector.re('job', 'cortex-gw')],
        distributor: [utils.selector.re('job', 'loki-distributor')],
        ingester: [utils.selector.re('job', 'loki-ingester')],
        querier: [utils.selector.re('job', 'loki-querier')],
      },
      templating+: {
        list:
          std.map(
            function(item)
              if item.name == 'cluster' then
                item+ {
                  query: 'label_values(loki_build_info, monitoring_stack)'
                }
              else
              if item.name == 'namespace' then
                item+ {
                  query: 'label_values(loki_build_info{monitoring_stack=~"$cluster"}, namespace)'
                }
              else
                item,
            super.list
          ),
      },
    },
    'loki-reads-resources.json'+:{
      templating+: {
        list:
          std.map(
            function(item)
              if item.name == 'cluster' then
                item+ {
                  query: 'label_values(loki_build_info, monitoring_stack)'
                }
              else
              if item.name == 'namespace' then
                item+ {
                  query: 'label_values(loki_build_info{monitoring_stack=~"$cluster"}, namespace)'
                }
              else
                item,
            super.list
          ),
      },
    },
    'loki-reads.json'+:{
      clusterLabel:: 'monitoring_stack',
      matchers:: {
        cortexgateway: [utils.selector.re('job', 'cortex-gw')],
        queryFrontend: [utils.selector.re('job', 'loki-query-frontend')],
        querier: [utils.selector.re('job', 'loki-querier')],
        ingester: [utils.selector.re('job', 'loki-ingester')],
        querierOrIndexGateway: [utils.selector.re('job', 'loki-(querier|index-gateway)')],
      },
      templating+: {
        list:
          std.map(
            function(item)
              if item.name == 'cluster' then
                item+ {
                  query: 'label_values(loki_build_info, monitoring_stack)'
                }
              else
              if item.name == 'namespace' then
                item+ {
                  query: 'label_values(loki_build_info{monitoring_stack=~"$cluster"}, namespace)'
                }
              else
                item,
            super.list
          ),
      },
    },
    'loki-retention.json'+:{
      templating+: {
        list:
          std.map(
            function(item)
              if item.name == 'cluster' then
                item+ {
                  query: 'label_values(loki_build_info, monitoring_stack)'
                }
              else
              if item.name == 'namespace' then
                item+ {
                  query: 'label_values(loki_build_info{monitoring_stack=~"$cluster"}, namespace)'
                }
              else
                item,
            super.list
          ),
      },
    },
    'loki-writes-resources.json'+:{
      templating+: {
        list:
          std.map(
            function(item)
              if item.name == 'cluster' then
                item+ {
                  query: 'label_values(loki_build_info, monitoring_stack)'
                }
              else
              if item.name == 'namespace' then
                item+ {
                  query: 'label_values(loki_build_info{monitoring_stack=~"$cluster"}, namespace)'
                }
              else
                item,
            super.list
          ),
      },
    },
    'loki-writes.json'+:{
      clusterLabel:: 'monitoring_stack',
      matchers:: {
        cortexgateway: [utils.selector.re('job', 'cortex-gw')],
        distributor: [utils.selector.re('job', 'loki-distributor')],
        ingester: [utils.selector.re('job', 'loki-ingester')],
      },
      templating+: {
        list:
          std.map(
            function(item)
              if item.name == 'cluster' then
                item+ {
                  query: 'label_values(loki_build_info, monitoring_stack)'
                }
              else
              if item.name == 'namespace' then
                item+ {
                  query: 'label_values(loki_build_info{monitoring_stack=~"$cluster"}, namespace)'
                }
              else
                item,
            super.list
          ),
      },
    },
  }
}

Originally posted by @DaveOHenry in #1978 (comment)

@Ottovsky
Copy link

Hey, thanks a lot for this, I have hit the same problem with loki-mixin not being suitable for loki-distributed out of the box.

@stale
Copy link

stale bot commented Mar 2, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Mar 2, 2022
@DaveOHenry
Copy link
Contributor Author

DaveOHenry commented Mar 2, 2022 via email

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Mar 2, 2022
@mordp1
Copy link

mordp1 commented Mar 17, 2022

I override the almost complete dashboard definition to work here. I did need to review and adapt some dashboards, in my perspective not so clearly this loki-mixin.
But let me comment on something that i found to use here:

  1. "($namespace)/$job" = i used e.g. job=~".+-ingester.+"
  2. Me too
  3. Yeap, all containers="loki", i changed dashboards to looking for pod (e.g. pod=~".+-ingester.+")
  4. cortex-gw I suppose it's ingress. I use here nginx to do this, so i just delete the lines where i see job=~"($namespace)/cortex-gw", some dashboards looking for "loki_request_duration_seconds_count".
    the dashboard "Memory (go heap in use)" for gateway, I changed to collect nginx_ingress_controller_requests so the name now it is "Ingress Request"

Extra:

  • Some dashboards i see [$__rate_interval], so I did need to change to this: [1m] or [5m] to work fine
  • Almost all dashboards use this: job_route:loki_request_duration_seconds_bucket:sum_rate, I do not know why, but remove the suffix(sum_rate) a prefix(job_route), works here (e.g just use loki_request_duration_seconds_bucket).
  • I used Prometheus to understand metrics and what kind off other labels we have and manipulate the queries

@pgassmann
Copy link
Contributor

the job_route:loki_request_duration_seconds_bucket:sum_rate are metrics generated by the recording rules. these rules are can also be generated from this mixin.

jsonnet -J vendor -S -e 'std.manifestYamlDoc((import "loki-mixin/recording_rules.libsonnet").prometheusRules)' > recording_rules.yml

@stale
Copy link

stale bot commented Apr 27, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Apr 27, 2022
@DaveOHenry
Copy link
Contributor Author

Dear stale bot. Please keep this issue open.

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Apr 28, 2022
@irizzant
Copy link
Contributor

@DaveOHenry I'm hitting the same issue with the mixins.

Why not filing a PR with the suggested updates for jsonnet?

@DaveOHenry
Copy link
Contributor Author

DaveOHenry commented May 17, 2022

I guess it's not as easy as it sounds, because there are many different functions and techniques used in the various dashboards.
My overrides are just another layer of configuration that work (kind of) in my environment. Therefore it is not possible to just pack them into a PR.
I mainly posted the jsonnet code to
a) hopefully save people some headaches
b) to raise the awareness for the current state of the loki-mixin
c) show the main problems that many people will certainly encounter

I literally spent hours trying to figure out how the mixin is supposed to work, but still don't know jsonnet good enough to implement the needed changes.
Additionally there is no consensus about how the loki-mixin should handle the configuration (or if it should be configurable at all ...).

@irizzant
Copy link
Contributor

It took me a lot to make it work in jsonnet too, but a PR is good to discuss about the changes and get these overrides to a point that could be useful for everyone.

@DaveOHenry
Copy link
Contributor Author

DaveOHenry commented May 18, 2022 via email

@irizzant
Copy link
Contributor

Please have a look at #6189, it's the same setup I'm currently using and the dashboards are working fine.
Feel free to add anything that could be missing there.

@DaveOHenry
Copy link
Contributor Author

Looks like the linked PR #6189 does not fix anything related to this issue. There is still a lot of tweaking needed to make the mixin work and there is no hint for the user how to configure these kind of things.
#6266 basically describes the same problem and can be considered a duplicate of this issue.

However, some improvements were made in other PRs, namely:
#6383
#5535
#5536

... but the job name in many queries is still hardcoded to something like "($namespace)/$job" or "$namespace/compactor" and I also saw a hardcoded "cluster" label somewhere. I'll try to take a closer look if time permits.

@irizzant
Copy link
Contributor

irizzant commented Jun 29, 2022

The linked PR #6189 implies that:

  • you don't need any of the jsonnet overrides you reported for the selectors, just add the create_service_monitor parameter to the configuration and the generated ServiceMonitor will do the job
  • Prometheus Rules / Alerts and Grafana dashboards can be automatically generated by the snippet reported in the PR itself.

so it actually does address the problems reported in this issue

@DaveOHenry
Copy link
Contributor Author

How can a ServiceMonitor object possibly configure the selectors in the loki-mixin for me? I also don't use jsonnet to deploy Loki. So it actually does not fix anything in my case 😢

@irizzant
Copy link
Contributor

The ServiceMonitor object can very well make the existing dashboards selectors work out-of-the-box without adding any of the jsonnet you reported above, and indeed it's already doing this job in our current setup.

Of course it involves using Jsonnet (which is by the way the recommended installation method) but the very same principle can be applied with to a ServiceMonitor generated by a Helm chart.

@DaveOHenry
Copy link
Contributor Author

DaveOHenry commented Jun 29, 2022 via email

@irizzant
Copy link
Contributor

irizzant commented Jun 30, 2022

Just take a look at the thanos or kubernetes mixins. They are easily configurable. It takes a few minutes to set the correct selectors and everything is just working. These mixins do not care if you use jsonnet, Helm or any other method for deploying your infrastructure.

Actually all mixins (thanos included) make use of jsonnet only and they do not provide any valid PrometheusRule or ConfigMap directly applicable to a k8s cluster.
This is why they require tools to transform their content into valid k8s resources, like the compiled Loki manifests reported here.
Depending on what deployment method you use (Helm, Jsonnet) you have to take the appropriate actions to get the k8s resources you need.

Given the compiled manifests, it shouldn't be a problem to include them in Helm charts and also add a ServiceMonitor which computes the right labeling to make Grafana dashboards work as expected.

I'll see if I have some time to work on this too and file a PR

@DaveOHenry
Copy link
Contributor Author

DaveOHenry commented Jun 30, 2022

Most people do not care how these mixins are written. They just need a config file, install a tool and execute 1-2 commands to generate the manifests. See also https://monitoring.mixins.dev

I agree that the ServiceMonitor route works for some cases, but most people have a defined set of labels for all of their applications and they certainly do not want to change that just for this single mixin.
Also keep in mind that there are also users that use a vanilla Prometheus without any operator.

@irizzant
Copy link
Contributor

Most people do not care how these mixins are written. They just need a config file, install a tool and execute 1-2 commands to generate the manifests.

I agree, but this depends on the way you use the mixins: if you use jsonnet you can import and use jsonnet directly, if you use Helm you first have to generate the manifests.

I agree that the ServiceMonitor route works for some cases, but most people have a defined set of labels for all of their applications and they certainly do not want to change that just for this single mixin.

I can't see how the ServiceMonitor labeling done in the PR could impact the existing set of labels for their applications at all.
In any case, the labeling is customizable.

Also keep in mind that there are also users that use a vanilla Prometheus without any operator.

Then they will need custom tools to inject Rules and Alerts into Prometheus yaml configuration anyway.
Either way, as you can see, the way you use mixins is actually dependent on the way you deployed Prometheus and Loki itself.

I totally agree that mixins should be easy to use and, compared to the way it was before, they are now for jsonnet users.
As I wrote, whenever I find time to do that, I'll try to file a PR to help Helm users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants