Skip to content

Commit

Permalink
[feature] add Apache Hbase RegionServer monitoring (#1833)
Browse files Browse the repository at this point in the history
Co-authored-by: zhangshenghang <shenghang.zhang@avrisdigital.com>
Co-authored-by: zhangshenghang <admin@hadoop.wiki>
Co-authored-by: tomsun28 <tomsun28@outlook.com>
  • Loading branch information
4 people authored Apr 25, 2024
1 parent 2fa3b5a commit 4a3e273
Show file tree
Hide file tree
Showing 5 changed files with 774 additions and 1 deletion.
96 changes: 96 additions & 0 deletions home/docs/help/hbase_regionserver.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
id: hbase_regionserver
title: Monitoring HBase RegionServer Monitoring
sidebar_label: HBase RegionServer Monitoring
keywords: [Open-source monitoring system, Open-source database monitoring, RegionServer monitoring]
---
> Collect and monitor common performance metrics for HBase RegionServer.
**Protocol:** HTTP

## Pre-Monitoring Operations

Review the `hbase-site.xml` file to obtain the value of the `hbase.regionserver.info.port` configuration item, which is used for monitoring.

## Configuration Parameters


| Parameter Name | Parameter Description |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| Target Host | The IPV4, IPV6, or domain name of the monitored entity. Note ⚠️ Do not include the protocol header (e.g., https://, http://). |
| Port | The port number of the HBase regionserver, default is 16030, i.e., the value of the`hbase.regionserver.info.port` parameter |
| Task Name | A unique name to identify this monitoring task. |
| Query Timeout | Set the timeout for Kafka connections in milliseconds, default is 3000 ms. |
| Collection Interval | The interval time for periodic data collection in seconds, with a minimum interval of 30 seconds. |
| Probe Before Adding | Whether to probe and check the availability of monitoring before adding new monitoring, only proceed with the addition if the probe is successful. |
| Description Note | Additional notes to identify and describe this monitoring, users can add notes here. |

### Collection Metrics

> All metric names are directly referenced from the official fields, hence there may be non-standard naming.
#### Metric Set: server


| Metric Name | Unit | Metric Description |
| --------------------------------- | ----- | ------------------------------------------------------------------------- |
| regionCount | None | Number of Regions |
| readRequestCount | None | Number of read requests since cluster restart |
| writeRequestCount | None | Number of write requests since cluster restart |
| averageRegionSize | MB | Average size of a Region |
| totalRequestCount | None | Total number of requests |
| ScanTime_num_ops | None | Total number of Scan requests |
| Append_num_ops | None | Total number of Append requests |
| Increment_num_ops | None | Total number of Increment requests |
| Get_num_ops | None | Total number of Get requests |
| Delete_num_ops | None | Total number of Delete requests |
| Put_num_ops | None | Total number of Put requests |
| ScanTime_mean | None | Average time of a Scan request |
| ScanTime_min | None | Minimum time of a Scan request |
| ScanTime_max | None | Maximum time of a Scan request |
| ScanSize_mean | bytes | Average size of a Scan request |
| ScanSize_min | None | Minimum size of a Scan request |
| ScanSize_max | None | Maximum size of a Scan request |
| slowPutCount | None | Number of slow Put operations |
| slowGetCount | None | Number of slow Get operations |
| slowAppendCount | None | Number of slow Append operations |
| slowIncrementCount | None | Number of slow Increment operations |
| slowDeleteCount | None | Number of slow Delete operations |
| blockCacheSize | None | Size of memory used by block cache |
| blockCacheCount | None | Number of blocks in Block Cache |
| blockCacheExpressHitPercent | None | Block cache hit ratio |
| memStoreSize | None | Size of Memstore |
| FlushTime_num_ops | None | Number of RS writes to disk/Memstore flushes |
| flushQueueLength | None | Length of Region Flush queue |
| flushedCellsSize | None | Size flushed to disk |
| storeFileCount | None | Number of Storefiles |
| storeCount | None | Number of Stores |
| storeFileSize | None | Size of Storefiles |
| compactionQueueLength | None | Length of Compaction queue |
| percentFilesLocal | None | Percentage of HFile in local HDFS Data Node |
| percentFilesLocalSecondaryRegions | None | Percentage of HFile for secondary region replicas in local HDFS Data Node |
| hlogFileCount | None | Number of WAL files |
| hlogFileSize | None | Size of WAL files |

#### Metric Set: IPC


| Metric Name | Unit | Metric Description |
| ------------------------- | ---- | -------------------------------------- |
| numActiveHandler | None | Current number of RITs |
| NotServingRegionException | None | Number of RITs exceeding the threshold |
| RegionMovedException | ms | Duration of the oldest RIT |
| RegionTooBusyException | ms | Duration of the oldest RIT |

#### Metric Set: JVM


| Metric Name | Unit | Metric Description |
| -------------------- | ---- | --------------------------------- |
| MemNonHeapUsedM | None | Current active RegionServer list |
| MemNonHeapCommittedM | None | Current offline RegionServer list |
| MemHeapUsedM | None | Zookeeper list |
| MemHeapCommittedM | None | Master node |
| MemHeapMaxM | None | Cluster balance load times |
| MemMaxM | None | RPC handle count |
| GcCount | MB | Cluster data reception volume |
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
id: hbase_master
title: 监控:Hbase Master监控
sidebar_label: HbaseMaster监控
sidebar_label: Apache Hbase Master
keywords: [开源监控系统, 开源数据库监控, HbaseMaster监控]
---
> 对Hbase Master的通用性能指标进行采集监控
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
id: hbase_regionserver
title: 监控 Hbase RegionServer监控
sidebar_label: Apache Hbase RegionServer
keywords: [开源监控系统, 开源数据库监控, RegionServer监控]
---
> 对Hbase RegionServer的通用性能指标进行采集监控
**使用协议:HTTP**

## 监控前操作

查看 `hbase-site.xml` 文件,获取 `hbase.regionserver.info.port` 配置项的值,该值用作监控使用。

## 配置参数


| 参数名称 | 参数帮助描述 |
| ------------ |---------------------------------------------------------------------|
| 目标Host | 被监控的对端IPV4,IPV6或域名。注意⚠️不带协议头(eg: https://, http://)。 |
| 端口 | hbase regionserver的端口号,默认为16030。即:`hbase.regionserver.info.port`参数值 |
| 任务名称 | 标识此监控的名称,名称需要保证唯一性。 |
| 查询超时时间 | 设置Kafka连接的超时时间,单位ms毫秒,默认3000毫秒。 |
| 采集间隔 | 监控周期性采集数据间隔时间,单位秒,可设置的最小间隔为30秒 |
| 是否探测 | 新增监控前是否先探测检查监控可用性,探测成功才会继续新增修改操作 |
| 描述备注 | 更多标识和描述此监控的备注信息,用户可以在这里备注信息 |

### 采集指标

> 所有指标名称均直接引用官方的字段,所以存在命名不规范。
#### 指标集合:server


| 指标名称 | 指标单位 | 指标帮助描述 |
| -------------------- |-------|------------------------------------------|
| regionCount || Region数量 |
| readRequestCount || 重启集群后的读请求数量 |
| writeRequestCount || 重启集群后的写请求数量 |
| averageRegionSize | MB | 平均Region大小 |
| totalRequestCount || 全部请求数量 |
| ScanTime_num_ops || Scan 请求总量 |
| Append_num_ops || Append 请求量 |
| Increment_num_ops || Increment请求量 |
| Get_num_ops || Get 请求量 |
| Delete_num_ops || Delete 请求量 |
| Put_num_ops || Put 请求量 |
| ScanTime_mean || 平均 Scan 请求时间 |
| ScanTime_min || 最小 Scan 请求时间 |
| ScanTime_max || 最大 Scan 请求时间 |
| ScanSize_mean | bytes | 平均 Scan 请求大小 |
| ScanSize_min || 最小 Scan 请求大小 |
| ScanSize_max || 最大 Scan 请求大小 |
| slowPutCount || 慢操作次数/Put |
| slowGetCount || 慢操作次数/Get |
| slowAppendCount || 慢操作次数/Append |
| slowIncrementCount || 慢操作次数/Increment |
| slowDeleteCount || 慢操作次数/Delete |
| blockCacheSize || 缓存块内存占用大小 |
| blockCacheCount || 缓存块数量_Block Cache 中的 Block 数量 |
| blockCacheExpressHitPercent || 读缓存命中率 |
| memStoreSize || Memstore 大小 |
| FlushTime_num_ops || RS写磁盘次数/Memstore flush 写磁盘次数 |
| flushQueueLength || Region Flush 队列长度 |
| flushedCellsSize || flush到磁盘大小 |
| storeFileCount || Storefile 个数 |
| storeCount || Store 个数 |
| storeFileSize || Storefile 大小 |
| compactionQueueLength || Compaction 队列长度 |
| percentFilesLocal || Region 的 HFile 位于本地 HDFS Data Node的比例 |
| percentFilesLocalSecondaryRegions || Region 副本的 HFile 位于本地 HDFS Data Node的比例 |
| hlogFileCount || WAL 文件数量 |
| hlogFileSize || WAL 文件大小 |

#### 指标集合:IPC


| 指标名称 | 指标单位 | 指标帮助描述 |
| --------------------- | ------ | ------------------- |
| numActiveHandler || 当前的 RIT 数量 |
| NotServingRegionException || 超过阈值的 RIT 数量 |
| RegionMovedException | ms | 最老的RIT的持续时间 |
| RegionTooBusyException | ms | 最老的RIT的持续时间 |

#### 指标集合:JVM


| 指标名称 | 指标单位 | 指标帮助描述 |
| ----------------------- | ----- | ------------------------ |
| MemNonHeapUsedM || 当前活跃RegionServer列表 |
| MemNonHeapCommittedM || 当前离线RegionServer列表 |
| MemHeapUsedM || Zookeeper列表 |
| MemHeapCommittedM || Master节点 |
| MemHeapMaxM || 集群负载均衡次数 |
| MemMaxM || RPC句柄数 |
| GcCount | MB | 集群接收数据量 |

2 changes: 2 additions & 0 deletions home/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,8 @@
"help/doris_be",
"help/doris_fe",
"help/hadoop",
"help/hbase_master",
"help/hbase_regionserver",
"help/iotdb",
"help/hive",
"help/airflow",
Expand Down
Loading

0 comments on commit 4a3e273

Please sign in to comment.