-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Clickhouse #412
Added Clickhouse #412
Conversation
severity: critical | ||
- name: ClickHouse No Available Replicas | ||
description: No available replicas in ClickHouse. | ||
query: "ClickHouseErrorMetric_NO_AVAILABLE_REPLICA == 1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would add a for: 3m
as well
a short unavailability (restart?) seems ok for me
same thing in the query below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this could help avoid false positives as well
_data/rules.yml
Outdated
severity: critical | ||
- name: ClickHouse High Network Traffic | ||
description: Network traffic is unusually high, may affect cluster performance. | ||
query: "ClickHouseMetrics_NetworkSend > 1000 or ClickHouseMetrics_NetworkReceive > 1000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this default value, seems very high to me, but my own cluster is not in production right now, so I cannot compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number could be deemed reasonable for a mid-to-large sized production level cluster but i stated in a comment to others would adjust this value based on their specifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can decrease it so it would be useful for almost all clusters right out of the box and in case of a larger cluster, the value could be raised.
_data/rules.yml
Outdated
- name: ClickHouse Authentication Failures | ||
description: Authentication failures detected, indicating potential security issues or misconfiguration. | ||
query: "increase(ClickHouseErrorMetric_AUTHENTICATION_FAILED[5m]) > 0" | ||
severity: critical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would use "warning" or "info" instead
A critical alert must be checked early, this rule is "just" a security notification.
WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, critical might sound a little excessive here :)
Thanks for your first contribution! I made some comments ;) |
Added reasonable time periods for each query to avoid false positives and in some cased give the system a short window to try to solve the issue. Also changed the severity level of authentication alerts from critical to info which seems more appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made excellent points.
Cheers
I made a few adjustments in time periods. See if they seem reasonable or not
IMHO, replication alerts must be sent right away.
I made some changes to the alerts. |
Added alerts for Clickhosue