Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Clickhouse #412

Merged
merged 4 commits into from
May 13, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions _data/rules.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1300,6 +1300,74 @@ groups:
severity: critical
for: 2m

- name: Clickhouse
exporters:
- name: Embedded Exporter
slug: embedded-exporter
doc_url: https://clickhouse.com/docs/en/operations/system-tables/metrics
rules:
- name: ClickHouse Memory Usage Critical
samber marked this conversation as resolved.
Show resolved Hide resolved
description: Memory usage is critically high, over 90%.
query: "ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 90"
severity: critical
- name: ClickHouse Memory Usage Warning
description: Memory usage is over 80%.
query: "ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 80"
severity: warning
- name: ClickHouse Disk Space Low on Default
description: Disk space on default is below 20%.
query: "ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 20"
severity: warning
- name: ClickHouse Disk Space Critical on Default
description: Disk space on default disk is critically low, below 10%.
query: "ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 10"
severity: critical
- name: ClickHouse Disk Space Low on Backups
description: Disk space on backups is below 20%.
query: "ClickHouseAsyncMetrics_DiskAvailable_backups / (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) * 100 < 20"
severity: warning
- name: ClickHouse Replica Errors
description: Critical replica errors detected, either all replicas are stale or lost.
query: "ClickHouseErrorMetric_ALL_REPLICAS_ARE_STALE == 1 or ClickHouseErrorMetric_ALL_REPLICAS_LOST == 1"
severity: critical
- name: ClickHouse No Available Replicas
description: No available replicas in ClickHouse.
query: "ClickHouseErrorMetric_NO_AVAILABLE_REPLICA == 1"
Copy link
Owner

@samber samber May 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would add a for: 3m as well

a short unavailability (restart?) seems ok for me

same thing in the query below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this could help avoid false positives as well

severity: critical
- name: ClickHouse No Live Replicas
description: There are too few live replicas available, risking data loss and service disruption.
query: "ClickHouseErrorMetric_TOO_FEW_LIVE_REPLICAS == 1"
severity: critical
- name: ClickHouse High Network Traffic
description: Network traffic is unusually high, may affect cluster performance.
query: "ClickHouseMetrics_NetworkSend > 1000 or ClickHouseMetrics_NetworkReceive > 1000"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this default value, seems very high to me, but my own cluster is not in production right now, so I cannot compare

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number could be deemed reasonable for a mid-to-large sized production level cluster but i stated in a comment to others would adjust this value based on their specifications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can decrease it so it would be useful for almost all clusters right out of the box and in case of a larger cluster, the value could be raised.

severity: warning
comments: |
Please replace the threshold with an appropriate value
- name: ClickHouse High TCP Connections
description: High number of TCP connections, indicating heavy client or inter-cluster communication.
query: "ClickHouseMetrics_TCPConnection > 1500"
severity: warning
comments: |
Please replace the threshold with an appropriate value
- name: ClickHouse Interserver Connection Issues
description: An increase in interserver connections may indicate replication or distributed query handling issues.
query: "increase(ClickHouseMetrics_InterserverConnection[5m]) > 0"
severity: warning
- name: ClickHouse ZooKeeper Connection Issues
description: ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination.
query: "avg(ClickHouseMetrics_ZooKeeperSession) != 1"
severity: warning
- name: ClickHouse Authentication Failures
description: Authentication failures detected, indicating potential security issues or misconfiguration.
query: "increase(ClickHouseErrorMetric_AUTHENTICATION_FAILED[5m]) > 0"
severity: critical
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would use "warning" or "info" instead

A critical alert must be checked early, this rule is "just" a security notification.

WDYT ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, critical might sound a little excessive here :)

- name: ClickHouse Access Denied Errors
description: Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts.
query: "increase(ClickHouseErrorMetric_RESOURCE_ACCESS_DENIED[5m]) > 0"
severity: critical


- name: Zookeeper
exporters:
- name: cloudflare/kafka_zookeeper_exporter
Expand Down
131 changes: 131 additions & 0 deletions dist/rules/clickhouse/embedded-exporter.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
groups:
samber marked this conversation as resolved.
Show resolved Hide resolved
- name: EmbeddedExporter
rules:
- alert: ClickHouseMemoryUsageCritical
expr: 'ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 90'
for: 2m
labels:
severity: critical
annotations:
summary: ClickHouse Memory Usage Critical (instance {{ $labels.instance }})
description: "Memory usage is critically high, over 90%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseMemoryUsageWarning
expr: 'ClickHouseAsyncMetrics_CGroupMemoryUsed / ClickHouseAsyncMetrics_CGroupMemoryTotal * 100 > 80'
for: 2m
labels:
severity: warning
annotations:
summary: ClickHouse Memory Usage Warning (instance {{ $labels.instance }})
description: "Memory usage is over 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseDiskSpaceLowDefault
expr: 'ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 20'
for: 2m
labels:
severity: warning
annotations:
summary: ClickHouse Disk Space Low on Default (instance {{ $labels.instance }})
description: "Disk space on default is below 20%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseDiskSpaceCriticalDefault
expr: 'ClickHouseAsyncMetrics_DiskAvailable_default / (ClickHouseAsyncMetrics_DiskAvailable_default + ClickHouseAsyncMetrics_DiskUsed_default) * 100 < 10'
for: 2m
labels:
severity: critical
annotations:
summary: ClickHouse Disk Space Critical on Default Disk (instance {{ $labels.instance }})
description: "Disk space on default disk is critically low, below 10%.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseDiskSpaceLowBackups
expr: 'ClickHouseAsyncMetrics_DiskAvailable_backups / (ClickHouseAsyncMetrics_DiskAvailable_backups + ClickHouseAsyncMetrics_DiskUsed_backups) * 100 < 20'
for: 2m
labels:
severity: warning
annotations:
summary: ClickHouse Disk Space Low on Backups (instance {{ $labels.instance }})
description: "Disk space on backups is below 20%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseReplicaErrors
expr: 'ClickHouseErrorMetric_ALL_REPLICAS_ARE_STALE == 1 or ClickHouseErrorMetric_ALL_REPLICAS_LOST == 1'
for: 0m
labels:
severity: critical
annotations:
summary: ClickHouse Replica Errors Detected (instance {{ $labels.instance }})
description: "Critical replica errors detected, either all replicas are stale or lost.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseNoAvailableReplicas
expr: 'ClickHouseErrorMetric_NO_AVAILABLE_REPLICA == 1'
for: 0m
labels:
severity: critical
annotations:
summary: No Available Replicas in ClickHouse (instance {{ $labels.instance }})
description: "No available replicas in ClickHouse.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseNoLiveReplicas
expr: 'ClickHouseErrorMetric_TOO_FEW_LIVE_REPLICAS == 1'
for: 0m
labels:
severity: critical
annotations:
summary: No Live Replicas in ClickHouse (instance {{ $labels.instance }})
description: "There are too few live replicas available, risking data loss and service disruption.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"


- alert: ClickHouseNetworkUsageHigh
expr: 'ClickHouseMetrics_NetworkSend > 1000 or ClickHouseMetrics_NetworkReceive > 1000'
for: 5m
labels:
severity: warning
annotations:
summary: High Network Traffic in ClickHouse (instance {{ $labels.instance }})
description: "Network traffic is unusually high, may affect cluster performance.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseHighTCPConnections
expr: 'ClickHouseMetrics_TCPConnection > 1500'
for: 5m
labels:
severity: warning
annotations:
summary: High TCP Connections in ClickHouse (instance {{ $labels.instance }})
description: "High number of TCP connections, indicating heavy client or inter-cluster communication.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseInterserverConnectionIssues
expr: 'increase(ClickHouseMetrics_InterserverConnection[5m]) > 0'
for: 0m
labels:
severity: warning
annotations:
summary: Interserver Connection Issues in ClickHouse (instance {{ $labels.instance }})
description: "An increase in interserver connections may indicate replication or distributed query handling issues.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseZooKeeperConnectionIssues
expr: 'avg(ClickHouseMetrics_ZooKeeperSession) != 1'
for: 5m
labels:
severity: warning
annotations:
summary: ZooKeeper Connection Issues in ClickHouse (instance {{ $labels.instance }})
description: "ClickHouse is experiencing issues with ZooKeeper connections, which may affect cluster state and coordination.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseAuthenticationFailures
expr: 'increase(ClickHouseErrorMetric_AUTHENTICATION_FAILED[5m]) > 0'
for: 0m
labels:
severity: critical
annotations:
summary: Authentication Failures in ClickHouse (instance {{ $labels.instance }})
description: "Authentication failures detected, indicating potential security issues or misconfiguration.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

- alert: ClickHouseAccessDeniedErrors
expr: 'increase(ClickHouseErrorMetric_RESOURCE_ACCESS_DENIED[5m]) > 0'
for: 1m
labels:
severity: critical
annotations:
summary: Access Denied Errors in ClickHouse (instance {{ $labels.instance }})
description: "Access denied errors have been logged, which could indicate permission issues or unauthorized access attempts.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"