Monitor metrics description

This document is a description of the monitoring metrics and all headings correspond to the dashboard on Grafana.

axon-node

Resource Overview

Overall total 5m load & average CPU used

type: CPU
description: Monitor overall cpu usage

Legende details

CPU Cores

Number of cores for all CPUs

count(node_cpu_seconds_total{job=~"node_exporter", mode='system'})

Total 5m load

load5 for all CPUs

sum(node_load5{job=~"node_exporter"})

Overall average used%

Average utilization of all CPUs

avg(1 - avg(irate(node_cpu_seconds_total{job=~"node_exporter",mode="idle"}[5m])) by (instance)) * 100

Alert threshold:

Utilization rate over 60%

Load5 Avg

load5 Avg for all CPUs

sum(node_load5{job=~"node_exporter"}) / count(node_cpu_seconds_total{job=~"node_exporter", mode='system'})

Alert threshold:

Load5 Avg greater than 0.7

Overall total memory & average memory used

type: Memory
description: Monitor overall memory usage

Legende details

Total

Total memory

sum(node_memory_MemTotal_bytes{job=~"node_exporter"})

Total Used

Overall used memory

sum(node_memory_MemTotal_bytes{job=~"node_exporter"} - node_memory_MemAvailable_bytes{job=~"node_exporter"})

Overall Average Used%

Utilization of all memory

(sum(node_memory_MemTotal_bytes{job=~"node_exporter"} - node_memory_MemAvailable_bytes{job=~"node_exporter"}) / sum(node_memory_MemTotal_bytes{job=~"node_exporter"}))*100

Alert threshold:

Utilization rate over 70%

Overall total disk & average disk used%

type: Disk
description: Monitor overall disk usage

Legende details

Total

Total memory

sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance))

Total Used

Overall used disk

sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance))

Overall Average Used%

Utilization of all disk

(sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance))) *100/(sum(avg(node_filesystem_avail_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance))+(sum(avg(node_filesystem_size_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~"node_exporter",fstype=~"xfs|ext.*"})by(device,instance))))

Alert threshold:

Utilization rate over 70%

Resource Details

Internet traffic per hour

type: Network
description: Traffic statistics

Legende details

receive

Receive statistics

increase(node_network_receive_bytes_total{instance=~"$node",device=~"$device"}[60m])

transmit

transmit statistics

increase(node_network_transmit_bytes_total{instance=~"$node",device=~"$device"}[60m])

CPU% Basic

type: CPU
description: Node CPU usage

Legende details

System

Average sy ratio

avg(irate(node_cpu_seconds_total{instance=~"$node",mode="system"}[5m])) by (instance) *100

User

Average sy ratio

avg(irate(node_cpu_seconds_total{instance=~"$node",mode="user"}[5m])) by (instance) *100

Iowait

Average sy ratio

avg(irate(node_cpu_seconds_total{instance=~"$node",mode="iowait"}[5m])) by (instance) *100

Total

Average CPU usage

(1 - avg(irate(node_cpu_seconds_total{instance=~"$node",mode="idle"}[5m])) by (instance))*100

Average used%

Not show, for alert

(1 - avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) *100

Alert threshold:

Utilization rate over 60%

Memory Basic

type: Memory
description: Node memory usage

Legende details

Total

Total memory

node_memory_MemTotal_bytes{instance=~"$node"}

Used

Used memory

node_memory_MemTotal_bytes{instance=~"$node"} - node_memory_MemAvailable_bytes{instance=~"$node"}

Avaliable

Available memory size

node_memory_MemAvailable_bytes{instance=~"$node"}

Used%

Utilization of all memory

(1 - (node_memory_MemAvailable_bytes{instance=~"$node"} / (node_memory_MemTotal_bytes{instance=~"$node"})))* 100

{{instance}}-Used%

Not show, for alert

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))* 100

Alert threshold:

Utilization rate over 60%

Network bandwidth usage per second all

type: Network
description: Network bandwidth

Legende details

receive

Receive statistics per second

irate(node_network_receive_bytes_total{instance=~'$node',device=~"$device"}[5m])*8

transmit

Transmit statistics per second

irate(node_network_transmit_bytes_total{instance=~'$node',device=~"$device"}[5m])*8

System Load

type: CPU
description: System Load

Legende details

1m

Load 1

node_load1{instance=~"$node"}

5m

Load 5

node_load5{instance=~"$node"}

15m

Load 15

node_load15{instance=~"$node"}

CPU cores

Number of cores for CPU

sum(count(node_cpu_seconds_total{instance=~"$node", mode='system'}) by (cpu,instance)) by(instance)

Load5 Avg

load5 Avg for CPU

avg(node_load5{instance=~"$node"}) / count(node_cpu_seconds_total{instance=~"$node", mode='system'})

Load5 Avg-{{instance}}

Not show, for alert

sum(node_load5) by (instance) / count(node_cpu_seconds_total{job=~"node_exporter", mode='system'}) by (instance)

Alert threshold: Load5 Avg greater than 0.7

Disk R/W Data

type: Disk
description: Disk throughput

Legende details

Read bytes

node_load1{instance=~"$node"}

Written bytes

node_load5{instance=~"$node"}

Disk Space Used% Basic

type: Disk
description: IOPS

Legende details

mountpoint

Disk space utilization

(node_filesystem_size_bytes{instance=~'$node',fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{instance=~'$node',fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}-node_filesystem_free_bytes{instance=~'$node',fstype=~"ext.*|xfs",mountpoint !~".*pod.*"}))

Disk IOps Completed（IOPS）

type: Disk
description: IOPS

Legende details

Reads completed

Read IOPS

irate(node_disk_io_time_seconds_total{instance=~"$node"}[5m])

Writes completed

Write IOPS

irate(node_disk_io_time_seconds_total{instance=~"(.*):9100"}[5m])

Time Spent Doing I/Os

type: Disk
description: I/O Utilization

Legende details

IO time

I/O Utilization

irate(node_disk_io_time_seconds_total{instance=~"$node"}[5m])

{{instance}}-%util

Not show, for alert

irate(node_disk_io_time_seconds_total{instance=~"(.*):9100"}[5m])

Alert threshold: Utilization rate over 80%

Disk R/W Time(Reference: less than 100ms)(beta)

type: Disk
description: Average response time

Legende details

Read time

irate(node_disk_read_time_seconds_total{instance=~"$node"}[5m]) / irate(node_disk_reads_completed_total{instance=~"$node"}[5m])

Write time

irate(node_disk_write_time_seconds_total{instance=~"$node"}[5m]) / irate(node_disk_writes_completed_total{instance=~"$node"}[5m])

Network Sockstat

type: Network
description: Socket State

Legende details

CurrEstab

Number of ESTABLISHED state connections

node_netstat_Tcp_CurrEstab{instance=~'$node'}

TCP_tw status

Number of time_wait state connections

node_sockstat_TCP_tw{instance=~'$node'}

Sockets_used

Total number of all protocol sockets used

node_sockstat_sockets_used{instance=~'$node'}

UDP_inuse

Number of UDP sockets in use

node_sockstat_UDP_inuse{instance=~'$node'}

TCP_alloc

Number of TCP sockets(ESTABLISHED, sk_buff)

node_sockstat_TCP_alloc{instance=~'$node'}

Tcp_PassiveOpens

Number of passively opened tcp connections

irate(node_netstat_Tcp_PassiveOpens{instance=~'$node'}[5m])

Tcp_ActiveOpens

Number of active open tcp connections

irate(node_netstat_Tcp_ActiveOpens{instance=~'$node'}[5m])

Tcp_InSegs

Number of tcp messages received

irate(node_netstat_Tcp_InSegs{instance=~'$node'}[5m])

Tcp_OutSegs

Number of tcp messages transmit

irate(node_netstat_Tcp_OutSegs{instance=~'$node'}[5m])

Tcp_RetransSegs

Number of tcp messages retransmitted

irate(node_netstat_Tcp_RetransSegs{instance=~'$node'}[5m])

Open File Descriptor(left)/Context switches(right)

type: Disk
description: I/O Utilization

Legende details

used filefd

Number of open file fd

node_filefd_allocated{instance=~"$node"}

switches

Context switches

irate(node_context_switches_total{instance=~"$node"}[5m])

Actuator Health

Axon Status

type: Axon
description: Axon service status

Legende details

active

Number of Axon services in up status

count(up{job="axon_exporter"} == 1)

down

Number of Axon services in down status

count(up{job="axon_exporter"} == 0)

/

Not show, for alert

up{job="axon_exporter"} == 0

Alert threshold:

The value of the Metric variable up is zero

Node Status

type: Node_exporter
description: Node_exporter service status

Legende details

active

Number of Node_exporter services in up status

count(up{job="node_exporter"} == 1)

down

Number of Node_exporter services in down status

count(up{job="node_exporter"} == 0)

down

Not show, for alert

up{job="node_exporter"} == 0

Alert threshold:

The value of the Metric variable up is zero

Promethues Status

type: Promethues
description: Promethues service status

Legende details

active

Number of Promethues services in up status

count(up{job="prometheus"} == 1)

down

Number of Promethues services in down status

count(up{job="prometheus"} == 0)

down

Not show, for alert

up{job="prometheus"} == 0

Alert threshold:

The value of the Metric variable up is zero

Jaeger Status

type: Jaeger
description: Jaeger service status

Legende details

jaeger-query-active

Number of Jaeger-query services in up status

count(up{instance=~"(.*):16687"} == 1)

jaeger-collector-active

Number of Jaeger-collector services in down status

count(up{instance=~"(.*):14269"} == 1)

jaeger-query-down

Number of Jaeger-query services in up status

count(up{instance=~"(.*):16687"} == 0)

jaeger-collector-down

Number of Jaeger-collector services in down status

count(up{instance=~"(.*):14269"} == 0)

/

Not show, for alert

up{instance=~"(.*):16687"} == 0

Alert threshold:

The value of the Metric variable up is zero

/

Not show, for alert

up{instance=~"(.*):14269"} == 0

Alert threshold:

The value of the Metric variable up is zero

Jaeger Agent Status

type: Jaeger
description: Jaeger agent status

Legende details

active

Number of Jaeger-agent services in up status

count(up{job="jaeger_agent"} == 1)

down

Number of Jaeger-agent services in down status

count(up{job="jaeger_agent"} == 0)

/

Not show, for alert

up{job="jaeger_agent"} == 0

Alert threshold:

The value of the Metric variable up is zero

Loki Status

type: Promtail
description: Loki service status

Legende details

active

Number of Loki services in up status

count(up{job="loki"} == 1)

down

Number of Loki services in down status

count(up{job="loki"} == 0)

/

Not show, for alert

up{job="loki"} == 0

Alert threshold:

The value of the Metric variable up is zero

Promtail Status

type: Promtail
description: Promtail service status

Legende details

active

Number of Promtail services in up status

count(up{job="promtail_agent"} == 1)

down

Number of Promtail services in down status

count(up{job="promtail_agent"} == 0)

/

Not show, for alert

up{job="promtail_agent"} == 0

Alert threshold:

The value of the Metric variable up is zero

axon-benchmark

TPS

description: TPS for consensus

Legende details

TPS

TPS for consensus

avg(rate(axon_consensus_committed_tx_total[5m]))

Alert threshold:

TPS is zero

consensus_p90

description: Consensus time for P90

Legende details

time_usage(s)

Consensus time for P90

avg(histogram_quantile(0.90, sum(rate(axon_consensus_duration_seconds_bucket[5m])) by (le, instance)))

/

Not show, for alert

avg(histogram_quantile(0.90, sum(rate(axon_consensus_duration_seconds_bucket[5m])) by (le, instance))) / avg(histogram_quantile(0.90, sum(rate(axon_consensus_time_cost_seconds_bucket{type="exec"}[5m])) by (le, instance)))

Alert threshold:

More than three rounds of consensus

exec_p90

description: Consensus exec time for P90

Legende details

/

Consensus exec time for P90

avg(histogram_quantile(0.90, sum(rate(axon_consensus_time_cost_seconds_bucket{type="exec"}[5m])) by (le, instance)))

put_cf_each_block_time_usage

description: Average time per block for rocksdb running put_cf

Legende details

/

Average time per block for rocksdb running put_cf

avg (sum by (instance) (increase(axon_storage_put_cf_seconds[5m]))) / avg(increase(axon_consensus_height[5m]))

get_cf_each_block_time_usage

description: Average time per block for rocksdb running get_cf

Legende details

/

Average time per block for rocksdb running get_cf

avg (sum by (instance) (increase(axon_storage_get_cf_seconds[5m]))) / avg(increase(axon_consensus_height[5m]))

processed_tx_request

description: received transaction request count in last 5 minutes (the unit is count/second)

Legende details

Total

Total number of transaction requests

sum(rate(axon_api_request_result_total{type="send_transaction"}[5m]))

Success Total

Total number of successful transaction requests

sum(rate(axon_api_request_result_total{result="success",type="send_transaction"}[5m]))

instance

processed transaction request count in last 5 minutes (the unit is count/second)

rate(axon_api_request_result_total{result="success", type="send_transaction"}[5m])

current_height

description: Chain current height

Legende details

Node current height

sort_desc(axon_consensus_height)

Liveness

description: Liveness

Legende details

Liveness

Growth in node height

increase(axon_consensus_height{job="axon_exporter"}[1m])

Alert threshold:

Loss of Liveness

/

Not show, for alert

up{job="axon_exporter"} == 1

Alert threshold:

Loss of Liveness

synced_block

description: Number of blocks synchronized by nodes

Legende details

Number of blocks synchronized by nodes

axon_consensus_sync_block_total

network_message_arrival_rate

description: Estimate the network message arrival rate in the last five minutes

Legende details

/

Estimate the network message arrival rate in the last five minutes

(
  # broadcast_count * (instance_count - 1)
  sum(increase(axon_network_message_total{target="all", direction="sent"}[5m])) * (count(count by (instance) (axon_network_message_total)) - 1)
  # unicast_count
  + sum(increase(axon_network_message_total{target="single", direction="sent"}[5m]))
) 
/
# received_count
(sum(increase(axon_network_message_total{direction="received"}[5m])))

consensus_round_cost

description: Number of rounds needed to reach consensus

Legende details

Number of rounds needed to reach consensus

(axon_consensus_round > 0 )

Alert threshold:

More than three rounds of consensus

mempool_cached_tx

description: Number of transactions in the current mempool

Legende details

Number of transactions in the current mempool

axon_mempool_tx_count

Connected Peers(Gauge)

description: Number of nodes on the current connection

Legende details

Number of nodes on the current connection

axon_network_connected_peers

Connected Peers(Graph)

description: Number of nodes on the current connection

Legende details

Saved peers

Total number of peers

max(axon_network_saved_peer_count)

Connected Peers

Number of nodes on the current connection

axon_network_connected_peers

Consensus peers(gauge)

description: Number of consensus nodes

Legende details

Number of consensus nodes

axon_network_tagged_consensus_peers

Consensus peers(Graph)

description: Number of consensus nodes

Legende details

Consensus peers

Total number of consensus peers

max(axon_network_tagged_consensus_peers)

{{instance}}-Connected Consensus Peers (Minus itself)

Number of consensus nodes

axon_network_connected_consensus_peers

/

Average utilization of all CPUs

(sum(axon_network_tagged_consensus_peers
) by (instance) - 1)
- sum(axon_network_connected_consensus_peers) by (instance)

Alert threshold:

Alert on loss of connection to a consensus node

Saved peers

description: Number of nodes saved peers

Legende details

Number of nodes saved peers

axon_network_saved_peer_count

Unidentified Connections

description: The number of connections in the handshake, requiring verification of the chain id

Legende details

The number of connections in the handshake, requiring verification of the chain id

axon_network_unidentified_connections

Connecting Peers

description: Number of active initiations to establish connections with other machines

Legende details

Number of active initiations to establish connections with other machines

axon_network_outbound_connecting_peers

Disconnected count(To other peers)

description: Disconnected count

Legende details

Disconnected count

axon_network_ip_disconnected_count

Received messages in processing

description: Number of messages being processed

Legende details

Number of messages being processed

axon_network_received_message_in_processing_guage

Received messages in processing by ip

description: Number of messages being processed (based on IP of received messages)

Legende details

Number of messages being processed (based on IP of received messages)

axon_network_received_ip_message_in_processing_guage{instance=~"$node"}

Ping (ms)_p90

description: p90 for P2p Ping

Legende details

p90 for P2p Ping

avg(histogram_quantile(0.90, sum(rate(axon_network_ping_in_ms_bucket[5m])) by (le, instance)))

Peer give up warnings

description: Peer give up warnings

Legende details

Log labels

Peer give up warnings

{filename="/opt/axon.log"} |~ "WARN" |~ "give up"

axon-network

height and round

description: Height of consensus and rounds of consensus

Legende details

height

Height of consensus

axon_consensus_height{instance=~"$node"}

round

Rounds of consensus

(axon_consensus_round{instance=~"$node"} > 0 )

axon_network_message_size

description: Network transmission size statistics

Legende details

send-total-{{instance}}

Each node send statistics

sum(url:axon_network_message_size:sum5m{direction="send"}) by (instance)

received-total-{{instance}}

Each node received statistics

sum(url:axon_network_message_size:sum5m{direction="received"}) by (instance)

send-total

Total send

sum(url:axon_network_message_size:sum5m{direction="send"})

received-total

Total received

sum(url:axon_network_message_size:sum5m{direction="received"})

network_send_by_url_size_and_count

description: Statistics on the number and size of network send

Legende details

Size of each url send

sum(url:axon_network_message_size:sum5m{direction="send",instance=~"$node"}) by (url)

count_{{action}}

Count of each url send

sum(rate(axon_network_message_total{direction="sent"}[5m])) by (action)

network_received_by_url_size_and_count

description: Statistics on the number and size of network send

Legende details

Size of each url received

sum(url:axon_network_message_size:sum5m{direction="received",instance=~"$node"}) by (url)

count_{{action}}

Count of each url received

sum(rate(axon_network_message_total{direction="received"}[5m])) by (action)

Files

metrics.md

Latest commit

History

metrics.md

File metadata and controls

Monitor metrics description

axon-node

Resource Overview

Overall total 5m load & average CPU used

CPU Cores

Total 5m load

Overall average used%

Alert threshold:

Load5 Avg

Alert threshold:

Overall total memory & average memory used

Total

Total Used

Overall Average Used%

Alert threshold:

Overall total disk & average disk used%

Total

Total Used

Overall Average Used%

Alert threshold:

Resource Details

Internet traffic per hour

receive

transmit

CPU% Basic

System

User

Iowait

Total

Average used%

Alert threshold:

Memory Basic

Total

Used

Avaliable

Used%

{{instance}}-Used%

Alert threshold:

Network bandwidth usage per second all

receive

transmit

System Load

1m

5m

15m

CPU cores

Load5 Avg

Load5 Avg-{{instance}}

Disk R/W Data

Read bytes

Written bytes

Disk Space Used% Basic

mountpoint

Disk IOps Completed（IOPS）

Reads completed

Writes completed

Time Spent Doing I/Os

IO time

{{instance}}-%util

Disk R/W Time(Reference: less than 100ms)(beta)

Read time

Write time

Network Sockstat

CurrEstab

TCP_tw status

Sockets_used

UDP_inuse

TCP_alloc

Tcp_PassiveOpens

Tcp_ActiveOpens

Tcp_InSegs

Tcp_OutSegs

Tcp_RetransSegs

Open File Descriptor(left)/Context switches(right)