This document is a description of the monitoring metrics and all headings correspond to the dashboard on Grafana.
- type: CPU
- description: Monitor overall cpu usage
Legende details
Number of cores for all CPUs
count(node_cpu_seconds_total{job=~"node_exporter", mode='system'})
load5 for all CPUs
sum(node_load5{job=~"node_exporter"})
Average utilization of all CPUs
avg(1 - avg(irate(node_cpu_seconds_total{job=~"node_exporter",mode="idle"}[5m])) by (instance)) * 100
Utilization rate over 60%
load5 Avg for all CPUs
sum(node_load5{job=~"node_exporter"}) / count(node_cpu_seconds_total{job=~"node_exporter", mode='system'})
Load5 Avg greater than 0.7
- type: Memory
- description: Monitor overall memory usage
Legende details
Total memory
sum(node_memory_MemTotal_bytes{job=~"node_exporter"})
Overall used memory
sum(node_memory_MemTotal_bytes{job=~"node_exporter"} - node_memory_MemAvailable_bytes{job=~"node_exporter"})
Utilization of all memory
(sum(node_memory_MemTotal_bytes{job=~"node_exporter"} - node_memory_MemAvailable_bytes{job=~"node_exporter"}) / sum(node_memory_MemTotal_bytes{job=~"node_exporter"}))*100
Utilization rate over 70%
- type: Disk
- description: Monitor overall disk usage
Legende details
Total memory
sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance))
Overall used disk
sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance))
Utilization of all disk
(sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance))) *100/(sum(avg(node_filesystem_avail_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance))+(sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance))))
Utilization rate over 70%
- type: Network
- description: Traffic statistics
Legende details
Receive statistics
increase(node_network_receive_bytes_total{instance=~"$node",device=~"$device"}[60m])
transmit statistics
increase(node_network_transmit_bytes_total{instance=~"$node",device=~"$device"}[60m])
- type: CPU
- description: Node CPU usage
Legende details
Average sy ratio
avg(irate(node_cpu_seconds_total{instance=~"$node",mode="system"}[5m])) by (instance) *100
Average sy ratio
avg(irate(node_cpu_seconds_total{instance=~"$node",mode="user"}[5m])) by (instance) *100
Average sy ratio
avg(irate(node_cpu_seconds_total{instance=~"$node",mode="iowait"}[5m])) by (instance) *100
Average CPU usage
(1 - avg(irate(node_cpu_seconds_total{instance=~"$node",mode="idle"}[5m])) by (instance))*100
Not show, for alert
(1 - avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) *100
Utilization rate over 60%
- type: Memory
- description: Node memory usage
Legende details
Total memory
node_memory_MemTotal_bytes{instance=~"$node"}
Used memory
node_memory_MemTotal_bytes{instance=~"$node"} - node_memory_MemAvailable_bytes{instance=~"$node"}
Available memory size
node_memory_MemAvailable_bytes{instance=~"$node"}
Utilization of all memory
(1 - (node_memory_MemAvailable_bytes{instance=~"$node"} / (node_memory_MemTotal_bytes{instance=~"$node"})))* 100
Not show, for alert
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))* 100
Utilization rate over 60%
- type: Network
- description: Network bandwidth
Legende details
Receive statistics per second
irate(node_network_receive_bytes_total{instance=~'$node',device=~"$device"}[5m])*8
Transmit statistics per second
irate(node_network_transmit_bytes_total{instance=~'$node',device=~"$device"}[5m])*8
- type: CPU
- description: System Load
Legende details
Load 1
node_load1{instance=~"$node"}
Load 5
node_load5{instance=~"$node"}
Load 15
node_load15{instance=~"$node"}
Number of cores for CPU
sum(count(node_cpu_seconds_total{instance=~"$node", mode='system'}) by (cpu,instance)) by(instance)
load5 Avg for CPU
avg(node_load5{instance=~"$node"}) / count(node_cpu_seconds_total{instance=~"$node", mode='system'})
Not show, for alert
sum(node_load5) by (instance) / count(node_cpu_seconds_total{job=~"node_exporter", mode='system'}) by (instance)
Alert threshold: Load5 Avg greater than 0.7
- type: Disk
- description: Disk throughput
Legende details
Read bytes
node_load1{instance=~"$node"}
Written bytes
node_load5{instance=~"$node"}
- type: Disk
- description: IOPS
Legende details
Disk space utilization
(node_filesystem_size_bytes{instance=~'$node',fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{instance=~'$node',fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{instance=~'$node',fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}))
- type: Disk
- description: IOPS
Legende details
Read IOPS
irate(node_disk_io_time_seconds_total{instance=~"$node"}[5m])
Write IOPS
irate(node_disk_io_time_seconds_total{instance=~"(.*):9100"}[5m])
- type: Disk
- description: I/O Utilization
Legende details
I/O Utilization
irate(node_disk_io_time_seconds_total{instance=~"$node"}[5m])
Not show, for alert
irate(node_disk_io_time_seconds_total{instance=~"(.*):9100"}[5m])
Alert threshold: Utilization rate over 80%
- type: Disk
- description: Average response time
Legende details
Read time
irate(node_disk_read_time_seconds_total{instance=~"$node"}[5m]) / irate(node_disk_reads_completed_total{instance=~"$node"}[5m])
Write time
irate(node_disk_write_time_seconds_total{instance=~"$node"}[5m]) / irate(node_disk_writes_completed_total{instance=~"$node"}[5m])
- type: Network
- description: Socket State
Legende details
Number of ESTABLISHED state connections
node_netstat_Tcp_CurrEstab{instance=~'$node'}
Number of time_wait state connections
node_sockstat_TCP_tw{instance=~'$node'}
Total number of all protocol sockets used
node_sockstat_sockets_used{instance=~'$node'}
Number of UDP sockets in use
node_sockstat_UDP_inuse{instance=~'$node'}
Number of TCP sockets(ESTABLISHED, sk_buff)
node_sockstat_TCP_alloc{instance=~'$node'}
Number of passively opened tcp connections
irate(node_netstat_Tcp_PassiveOpens{instance=~'$node'}[5m])
Number of active open tcp connections
irate(node_netstat_Tcp_ActiveOpens{instance=~'$node'}[5m])
Number of tcp messages received
irate(node_netstat_Tcp_InSegs{instance=~'$node'}[5m])
Number of tcp messages transmit
irate(node_netstat_Tcp_OutSegs{instance=~'$node'}[5m])
Number of tcp messages retransmitted
irate(node_netstat_Tcp_RetransSegs{instance=~'$node'}[5m])
- type: Disk
- description: I/O Utilization
Legende details
Number of open file fd
node_filefd_allocated{instance=~"$node"}
Context switches
irate(node_context_switches_total{instance=~"$node"}[5m])
- type: Axon
- description: Axon service status
Legende details
Number of Axon services in up status
count(up{job="axon_exporter"} == 1)
Number of Axon services in down status
count(up{job="axon_exporter"} == 0)
Not show, for alert
up{job="axon_exporter"} == 0
The value of the Metric variable up is zero
- type: Node_exporter
- description: Node_exporter service status
Legende details
Number of Node_exporter services in up status
count(up{job="node_exporter"} == 1)
Number of Node_exporter services in down status
count(up{job="node_exporter"} == 0)
Not show, for alert
up{job="node_exporter"} == 0
The value of the Metric variable up is zero
- type: Promethues
- description: Promethues service status
Legende details
Number of Promethues services in up status
count(up{job="prometheus"} == 1)
Number of Promethues services in down status
count(up{job="prometheus"} == 0)
Not show, for alert
up{job="prometheus"} == 0
The value of the Metric variable up is zero
- type: Jaeger
- description: Jaeger service status
Legende details
Number of Jaeger-query services in up status
count(up{instance=~"(.*):16687"} == 1)
Number of Jaeger-collector services in down status
count(up{instance=~"(.*):14269"} == 1)
Number of Jaeger-query services in up status
count(up{instance=~"(.*):16687"} == 0)
Number of Jaeger-collector services in down status
count(up{instance=~"(.*):14269"} == 0)
Not show, for alert
up{instance=~"(.*):16687"} == 0
The value of the Metric variable up is zero
Not show, for alert
up{instance=~"(.*):14269"} == 0
The value of the Metric variable up is zero
- type: Jaeger
- description: Jaeger agent status
Legende details
Number of Jaeger-agent services in up status
count(up{job="jaeger_agent"} == 1)
Number of Jaeger-agent services in down status
count(up{job="jaeger_agent"} == 0)
Not show, for alert
up{job="jaeger_agent"} == 0
The value of the Metric variable up is zero
- type: Promtail
- description: Loki service status
Legende details
Number of Loki services in up status
count(up{job="loki"} == 1)
Number of Loki services in down status
count(up{job="loki"} == 0)
Not show, for alert
up{job="loki"} == 0
The value of the Metric variable up is zero
- type: Promtail
- description: Promtail service status
Legende details
Number of Promtail services in up status
count(up{job="promtail_agent"} == 1)
Number of Promtail services in down status
count(up{job="promtail_agent"} == 0)
Not show, for alert
up{job="promtail_agent"} == 0
The value of the Metric variable up is zero
- description: TPS for consensus
Legende details
TPS for consensus
avg(rate(axon_consensus_committed_tx_total[5m]))
TPS is zero
- description: Consensus time for P90
Legende details
Consensus time for P90
avg(histogram_quantile(0.90, sum(rate(axon_consensus_duration_seconds_bucket[5m])) by (le, instance)))
Not show, for alert
avg(histogram_quantile(0.90, sum(rate(axon_consensus_duration_seconds_bucket[5m])) by (le, instance))) / avg(histogram_quantile(0.90, sum(rate(axon_consensus_time_cost_seconds_bucket{type="exec"}[5m])) by (le, instance)))
More than three rounds of consensus
- description: Consensus exec time for P90
Legende details
Consensus exec time for P90
avg(histogram_quantile(0.90, sum(rate(axon_consensus_time_cost_seconds_bucket{type="exec"}[5m])) by (le, instance)))
- description: Average time per block for rocksdb running put_cf
Legende details
Average time per block for rocksdb running put_cf
avg (sum by (instance) (increase(axon_storage_put_cf_seconds[5m]))) / avg(increase(axon_consensus_height[5m]))
- description: Average time per block for rocksdb running get_cf
Legende details
Average time per block for rocksdb running get_cf
avg (sum by (instance) (increase(axon_storage_get_cf_seconds[5m]))) / avg(increase(axon_consensus_height[5m]))
- description: received transaction request count in last 5 minutes (the unit is count/second)
Legende details
Total number of transaction requests
sum(rate(axon_api_request_result_total{type="send_transaction"}[5m]))
Total number of successful transaction requests
sum(rate(axon_api_request_result_total{result="success",type="send_transaction"}[5m]))
processed transaction request count in last 5 minutes (the unit is count/second)
rate(axon_api_request_result_total{result="success", type="send_transaction"}[5m])
- description: Chain current height
- description: Liveness
Legende details
Growth in node height
increase(axon_consensus_height{job="axon_exporter"}[1m])
Loss of Liveness
Not show, for alert
up{job="axon_exporter"} == 1
Loss of Liveness
- description: Number of blocks synchronized by nodes
- description: Estimate the network message arrival rate in the last five minutes
Legende details
Estimate the network message arrival rate in the last five minutes
(
# broadcast_count * (instance_count - 1)
sum(increase(axon_network_message_total{target="all", direction="sent"}[5m])) * (count(count by (instance) (axon_network_message_total)) - 1)
# unicast_count
+ sum(increase(axon_network_message_total{target="single", direction="sent"}[5m]))
)
/
# received_count
(sum(increase(axon_network_message_total{direction="received"}[5m])))
- description: Number of rounds needed to reach consensus
Legende details
Number of rounds needed to reach consensus
(axon_consensus_round > 0 )
More than three rounds of consensus
- description: Number of transactions in the current mempool
- description: Number of nodes on the current connection
- description: Number of nodes on the current connection
Legende details
Total number of peers
max(axon_network_saved_peer_count)
Number of nodes on the current connection
axon_network_connected_peers
- description: Number of consensus nodes
- description: Number of consensus nodes
Legende details
Total number of consensus peers
max(axon_network_tagged_consensus_peers)
Number of consensus nodes
axon_network_connected_consensus_peers
Average utilization of all CPUs
(sum(axon_network_tagged_consensus_peers
) by (instance) - 1)
- sum(axon_network_connected_consensus_peers) by (instance)
Alert on loss of connection to a consensus node
- description: Number of nodes saved peers
- description: The number of connections in the handshake, requiring verification of the chain id
Legende details
The number of connections in the handshake, requiring verification of the chain id
axon_network_unidentified_connections
- description: Number of active initiations to establish connections with other machines
Legende details
Number of active initiations to establish connections with other machines
axon_network_outbound_connecting_peers
- description: Disconnected count
- description: Number of messages being processed
Legende details
Number of messages being processed
axon_network_received_message_in_processing_guage
- description: Number of messages being processed (based on IP of received messages)
Legende details
Number of messages being processed (based on IP of received messages)
axon_network_received_ip_message_in_processing_guage{instance=~"$node"}
- description: p90 for P2p Ping
Legende details
p90 for P2p Ping
avg(histogram_quantile(0.90, sum(rate(axon_network_ping_in_ms_bucket[5m])) by (le, instance)))
- description: Peer give up warnings
link axon-node (Network bandwidth usage per second all)
link axon-node (Internet traffic per hour)
link axon-benchmark (mempool_cached_tx)
link axon-benchmark (consensus_round_cost)
link axon-benchmark (current_height)
link axon-benchmark (synced_block)
link axon-benchmark (processed_tx_request)
- description: Height of consensus and rounds of consensus
Legende details
Height of consensus
axon_consensus_height{instance=~"$node"}
Rounds of consensus
(axon_consensus_round{instance=~"$node"} > 0 )
- description: Network transmission size statistics
Legende details
Each node send statistics
sum(url:axon_network_message_size:sum5m{direction="send"}) by (instance)
Each node received statistics
sum(url:axon_network_message_size:sum5m{direction="received"}) by (instance)
Total send
sum(url:axon_network_message_size:sum5m{direction="send"})
Total received
sum(url:axon_network_message_size:sum5m{direction="received"})
- description: Statistics on the number and size of network send
Legende details
Size of each url send
sum(url:axon_network_message_size:sum5m{direction="send",instance=~"$node"}) by (url)
Count of each url send
sum(rate(axon_network_message_total{direction="sent"}[5m])) by (action)
- description: Statistics on the number and size of network send