Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add missing cluster label to mixins #12870

Merged
merged 11 commits into from
Sep 19, 2024

Conversation

QuentinBisson
Copy link
Contributor

What this PR does / why we need it:

This PR enhance the loki alert mixins by adding the cluster label to all summary.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@QuentinBisson QuentinBisson requested a review from a team as a code owner May 3, 2024 08:48
@QuentinBisson QuentinBisson changed the title enhance: add missing cluster label to mixins feat: add missing cluster label to mixins May 3, 2024
Signed-off-by: QuentinBisson <quentin@giantswarm.io>
@QuentinBisson QuentinBisson force-pushed the loki-mixins-add-cluster-label branch from a866c00 to 07ac356 Compare May 3, 2024 12:24
Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, some of these won't work as you'd like them to because the queries in each alerts expr are not keeping the cluster label.

@QuentinBisson
Copy link
Contributor Author

Oh you're right :(
Btw, i've just discovered a tool call pint that could detect such issues using ci validation. Do you think it would make sense to add it to the Mimir generated rules?

@cstyan
Copy link
Contributor

cstyan commented May 3, 2024

Oh you're right :( Btw, i've just discovered a tool call pint that could detect such issues using ci validation. Do you think it would make sense to add it to the Mimir generated rules?

It's worth looking into, happy to review a PR if you come up with something that works 👍

@QuentinBisson
Copy link
Contributor Author

Here is the output of pint without colors :(

I think it would be best to add it to the loki-build-image but I'm not sure if you like to have contribution on that repo.

./pint-linux-amd64 --offline lint --min-severity=information /home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/*.yaml
level=INFO msg="Finding all rules to check" paths=["/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml","/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/rules.yaml"]
level=INFO msg="Offline mode, skipping Prometheus discovery"
level=ERROR msg="Execution completed with error(s)" err="invalid --min-severity value: unknown severity: information"
> ./pint-linux-amd64 --offline lint --min-severity=info /home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/*.yaml
level=INFO msg="Finding all rules to check" paths=["/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml","/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/rules.yaml"]
level=INFO msg="Offline mode, skipping Prometheus discovery"
/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml:6-7 Bug: Template is using `cluster` label but the query removes it. (alerts/template)
 6 |             description: |
 7 |                 {{ $labels.cluster }} {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors.

/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml:6-7 Information: Using the value of `rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])` inside this annotation might be hard to read, consider using one of humanize template functions to make it more human friendly. (alerts/template)
 6 |             description: |
 7 |                 {{ $labels.cluster }} {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors.

/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml:19-20 Bug: Template is using `cluster` label but the query removes it. (alerts/template)
 19 |             description: |
 20 |                 {{ $labels.cluster }} {{ $labels.job }} is experiencing {{ printf "%.2f" $value }}% increase of panics.

level=INFO msg="Problems found" Bug=2 Information=1
level=ERROR msg="Execution completed with error(s)" err="found 1 problem(s) with severity Bug or higher"

@@ -52,7 +52,7 @@ groups:
message: |
{{`{{`}} $labels.cluster {{`}}`}} {{`{{`}} $labels.namespace {{`}}`}} has had {{`{{`}} printf "%.0f" $value {{`}}`}} compactors running for more than 5m. Only one compactor should run at a time.
expr: |
sum(loki_boltdb_shipper_compactor_running) by (namespace, cluster) > 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making sure they are all in the same order

Signed-off-by: QuentinBisson <quentin@giantswarm.io>
@QuentinBisson
Copy link
Contributor Author

@cstyan it should be fine now, pint is happy apart from the humanize thing

@cstyan
Copy link
Contributor

cstyan commented May 6, 2024

I think it would be best to add it to the loki-build-image but I'm not sure if you like to have contribution on that repo.

Agreed, it would be best to have it nicely runnable in a reproduce-able way. For now, can you show me how the invocation of pint works and I can setup a CI job via github actions at least post-merging of your PR.

Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; same as usual, I assume you've tried deploying the jsonnet version already to your own environment?

pint is happy apart from the humanize thing

not sure what you mean here?

@QuentinBisson
Copy link
Contributor Author

I meant this section:

/home/quentin/Documents/code/QuentinBisson/loki-upstream/production/loki-mixin-compiled-ssd/alerts.yaml:6-7 Information: Using the value of `rate(loki_request_duration_seconds_count{status_code=~"5.."}[2m])` inside this annotation might be hard to read, consider using one of humanize template functions to make it more human friendly. (alerts/template)
 6 |             description: |
 7 |                 {{ $labels.cluster }} {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}% errors.

But I did not really play with humanize functions :)

So I run pint from the cli for this one but I think you could use the pint ci directly:
https://cloudflare.github.io/pint/#pull-requests

There are a lot of things that could be added (like enforcing that some annotations or labels are always set) but an initial version would be good.

I will deploy it tomorrow or later today, I did not have time to full test my changes :( i'll ping you when i'm done and thank you for the review

@cstyan
Copy link
Contributor

cstyan commented Sep 17, 2024

@vlad-diachenko any idea why the helm diff action is failing here? I believe that's a new one you wrote?

@QuentinBisson
Copy link
Contributor Author

From what I see, it's using the wrong remote as the log line talks about origin but as my PR comes from a fork it's probably incorrect:

/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +refs/heads/loki-mixins-add-cluster-label*:refs/remotes/origin/loki-mixins-add-cluster-label* +refs/tags/loki-mixins-add-cluster-label*:refs/tags/loki-mixins-add-cluster-label*

@cstyan
Copy link
Contributor

cstyan commented Sep 18, 2024

From what I see, it's using the wrong remote as the log line talks about origin but as my PR comes from a fork it's probably incorrect:

/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +refs/heads/loki-mixins-add-cluster-label*:refs/remotes/origin/loki-mixins-add-cluster-label* +refs/tags/loki-mixins-add-cluster-label*:refs/tags/loki-mixins-add-cluster-label*

yep, this seems like the issue, Vlad is looking into it on our end

@QuentinBisson
Copy link
Contributor Author

Awesome thanks :)

@vlad-diachenko
Copy link
Contributor

Hey folks. I hope this PR fixes the issue #14173

@vlad-diachenko
Copy link
Contributor

Hey @QuentinBisson , @cstyan . I have fixed the checkout step that was failing at the beginning, but now it started to fail another step
image

I will check it tomorrow morning and will fix it and merge the PR ;)

@QuentinBisson
Copy link
Contributor Author

Thanks for the work @vlad-diachenko :)

@cstyan
Copy link
Contributor

cstyan commented Sep 19, 2024

Vlad mentioned to me that he couldn't get everything working correctly for PRs from forks, so he changed the workflow to not run on PRs from forks for now.

@cstyan cstyan merged commit 547ca70 into grafana:main Sep 19, 2024
65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants