-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: mixin / add loki compaction not successfull alert #14239
feat: mixin / add loki compaction not successfull alert #14239
Conversation
expr: ||| | ||
# The "last successful run" metric is updated even if the compactor owns no tenants, | ||
# so this alert correctly doesn't fire if compactor has nothing to do. | ||
(time() - loki_compactor_apply_retention_last_successful_run_timestamp_seconds > 60 * 60 * 24) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(time() - loki_compactor_apply_retention_last_successful_run_timestamp_seconds > 60 * 60 * 24) | |
(time() - loki_boltdb_shipper_compact_tables_operation_last_successful_run_timestamp_seconds > 60 * 60 * 24) |
might be better to use the compaction metric instead of the last successful retention run.
metric name is misleading here, it refers to boltdb but it is updated for tsdb indexes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I might give metrics renaming a go if I have time
9b0cb43
to
a13601c
Compare
time() - (loki_boltdb_shipper_compact_tables_operation_last_successful_run_timestamp_seconds{} > 0) | ||
) | ||
by (%s, namespace) | ||
> 60 * 60 * 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might be worth using a shorter interval for both alerts like 3h
? there would be query performance degradation if there are a lot of uncompacted indexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense yes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thank you! would be ideal if we could reduce the alert period as well :)
Done :) |
I think the failing checks are unrelated to this PR :D |
What this PR does / why we need it:
We recently had issues with Loki's compactor as we found out it actually never once ran successfully so we added an alert and I thought it was a good idea to add it here as well.
This alert was inspired from https://github.com/grafana/mimir/blob/41a6c6d0521f7994a0c7e031c7f35bafd45aa75c/operations/mimir-mixin-compiled/alerts.yaml#L876
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Checklist
CONTRIBUTING.md
guide (required)feat
PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.docs/sources/setup/upgrade/_index.md
production/helm/loki/Chart.yaml
and updateproduction/helm/loki/CHANGELOG.md
andproduction/helm/loki/README.md
. Example PRdeprecated-config.yaml
anddeleted-config.yaml
files respectively in thetools/deprecated-config-checker
directory. Example PR