Panel | Expression | Level | Thresholds | Description |
---|---|---|---|---|
$job:Overall total 5m load & average CPU used% | avg(1 - avg(irate( node_cpu_seconds_total{job=~"node_exporter",mode="idle"}[5m]) ) by (instance)) * 100 |
p0 | >= 90% | CPU Utilization |
sum(node_load5{job=~"node_exporter"}) / count(node_cpu_seconds_total{job=~"node_exporter", mode='system'}) |
p0 | >= 0.90 | CPU load5 | |
$job:Overall total memory & average memory used% | (sum(node_memory_MemTotal_bytes{job=~"node_exporter"} -node_memory_MemAvailable_bytes{job=~"node_exporter"}) / sum(node_memory_MemTotal_bytes{job=~"node_exporter"}))*100 |
p0 | >= 90% | Memory utilization |
$job:Overall total disk & average disk used% | (sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"}) by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"}) by(device,instance))) *100 /(sum(avg(node_filesystem_avail_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"}) by(device,instance)) +(sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"}) by(device,instance)) -sum(avg(node_filesystem_free_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"}) by(device,instance)))) |
p0 | >= 90% | Over 90% utilization of disk |
CPU% Basic | (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) *100 | p0 | >= 90% | Node CPU utilization |
Memory Basic | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))* 100 | p0 | >= 90% | Node memory utilization |
System Load | sum(node_load5) by (instance) / count(node_cpu_seconds_total{job=~"node_exporter", mode='system'}) by (instance) | p0 | >= 0.90 | Node CPU load5 |
Disk Space Used% Basic | (node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"} -node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"})*100 /(node_filesystem_avail_bytes {fstype=~"ext.*|xfs",mountpoint !~".*pod.*"} +(node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"} -node_filesystem_free_bytes{fstype=~"ext.*|xfs",mountpoint !~".*pod.*"})) |
p0 | >= 90% | Node disk utilization |
Time Spent Doing I/Os | irate(node_disk_io_time_seconds_total{instance=~"(.*):9100"}[5m]) | p0 | 90% | Node I/Os utilization |
Axon Status | up{job="axon_exporter"} == 0 | p0 | == 1 | AXON service status is down |
Node Status | up{job="node_exporter"} == 0 | p0 | == 1 | node_exporter service status is down |
Promethues Status | up{job="prometheus"} == 0 | p0 | == 1 | Promethues service status is down |
Jaeger Status | up{instance=~"(.*):16687"} == 0 | p0 | == 1 | jaeger-query service status is down |
up{instance=~"(.*):14269"} == 0 | p0 | == 1 | jaeger-collector service status is down | |
Jaeger Agent Status | up{job="jaeger_agent"} == 0 | p0 | == 1 | jaeger-agent service is down |
Loki Status | up{job="loki"} == 0 | p0 | == 1 | loki service is down |
Promtail Status | count(count_over_time({job="axon"}[5m])) by (hostip) | p0 | == 1 | Promtail service status is down |
Panel | Expression | Level | Thresholds | Description |
---|---|---|---|---|
TPS | avg(rate(axon_consensus_committed_tx_total[5m])) | p2 | 0 | TPS |
exec_p90 | avg(histogram_quantile(0.90, sum(rate(axon_consensus_time_cost_seconds_bucket{type="exec"}[5m])) by (le, instance))) | p2 | >= 2.4 | exec_90 |
consensus_round_cost | (axon_consensus_round > 0 ) | p1 | > = 5 | Rounds of Consensus |
consensus_p90 | avg(histogram_quantile(0.90, sum(rate(axon_consensus_duration_seconds_bucket[5m])) by (le, instance))) / avg(histogram_quantile(0.90, sum(rate(axon_consensus_time_cost_seconds_bucket{type="exec"}[5m])) by (le, instance))) | p1 | 1.1 | exec time is greater than consensus time |
Liveness | increase(axon_consensus_height{job="axon_exporter"}[1m]) | p0 | 0 | Loss of Liveness,no increase in height |
up{job="axon_exporter"} == 1 | 1 | |||
synced_block | changes(axon_consensus_sync_block_total[10m]) / changes(axon_consensus_height [10m]) | p1 | 1/1000? 10 min | Proportion of sync blocks |
Connected Consensus Peers | (sum(axon_network_tagged_consensus_peers ) by (instance) - 1) - sum(axon_network_connected_consensus_peers) by (instance) |
p0 | 1 | Consensus Network Disconnect |