Skip to content
This repository has been archived by the owner on Nov 7, 2023. It is now read-only.

Latest commit

 

History

History
201 lines (199 loc) · 6.38 KB

alert_table.md

File metadata and controls

201 lines (199 loc) · 6.38 KB

Axon alert table

axon-node

Panel Expression Level Thresholds Description
$job:Overall total 5m load & average CPU used% avg(1 - avg(irate(
node_cpu_seconds_total{job=~"node_exporter",mode="idle"}[5m])
)
by (instance)) * 100
p0 >= 90% CPU Utilization
sum(node_load5{job=~"node_exporter"})
/ count(node_cpu_seconds_total{job=~"node_exporter", mode='system'})
p0 >= 0.90 CPU load5
$job:Overall total memory & average memory used% (sum(node_memory_MemTotal_bytes{job=~"node_exporter"}
-node_memory_MemAvailable_bytes{job=~"node_exporter"})
/ sum(node_memory_MemTotal_bytes{job=~"node_exporter"}))*100
p0 >= 90% Memory utilization
$job:Overall total disk & average disk used% (sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})
by(device,instance))
- sum(avg(node_filesystem_free_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})
by(device,instance))) *100
/(sum(avg(node_filesystem_avail_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})
by(device,instance))
+(sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})
by(device,instance))
-sum(avg(node_filesystem_free_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})
by(device,instance))))
p0 >= 90% Over 90% utilization of disk
CPU% Basic (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) *100 p0 >= 90% Node CPU utilization
Memory Basic (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))* 100 p0 >= 90% Node memory utilization
System Load sum(node_load5) by (instance) / count(node_cpu_seconds_total{job=~"node_exporter", mode='system'}) by (instance) p0 >= 0.90 Node CPU load5
Disk Space Used% Basic (node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}
-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"})*100
/(node_filesystem_avail_bytes {fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}
+(node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}
-node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}))
p0 >= 90% Node disk utilization
Time Spent Doing I/Os irate(node_disk_io_time_seconds_total{instance=~"(.*):9100"}[5m]) p0 90% Node I/Os utilization
Axon Status up{job="axon_exporter"} == 0 p0 == 1 AXON service status is down
Node Status up{job="node_exporter"} == 0 p0 == 1 node_exporter service status is down
Promethues Status up{job="prometheus"} == 0 p0 == 1 Promethues service status is down
Jaeger Status up{instance=~"(.*):16687"} == 0 p0 == 1 jaeger-query service status is down
up{instance=~"(.*):14269"} == 0 p0 == 1 jaeger-collector service status is down
Jaeger Agent Status up{job="jaeger_agent"} == 0 p0 == 1 jaeger-agent service is down
Loki Status up{job="loki"} == 0 p0 == 1 loki service is down
Promtail Status count(count_over_time({job="axon"}[5m])) by (hostip) p0 == 1 Promtail service status is down

axon-benchmark

Panel Expression Level Thresholds Description
TPS avg(rate(axon_consensus_committed_tx_total[5m])) p2 0 TPS
exec_p90 avg(histogram_quantile(0.90, sum(rate(axon_consensus_time_cost_seconds_bucket{type="exec"}[5m])) by (le, instance))) p2 >= 2.4 exec_90
consensus_round_cost (axon_consensus_round > 0 ) p1 > = 5 Rounds of Consensus
consensus_p90 avg(histogram_quantile(0.90, sum(rate(axon_consensus_duration_seconds_bucket[5m])) by (le, instance))) / avg(histogram_quantile(0.90, sum(rate(axon_consensus_time_cost_seconds_bucket{type="exec"}[5m])) by (le, instance))) p1 1.1 exec time is greater than consensus time
Liveness increase(axon_consensus_height{job="axon_exporter"}[1m]) p0 0 Loss of Liveness,no increase in height
up{job="axon_exporter"} == 1 1
synced_block changes(axon_consensus_sync_block_total[10m]) / changes(axon_consensus_height [10m]) p1 1/1000? 10 min Proportion of sync blocks
Connected Consensus Peers (sum(axon_network_tagged_consensus_peers
) by (instance) - 1)
- sum(axon_network_connected_consensus_peers) by (instance)
p0 1 Consensus Network Disconnect