Skip to content

Commit

Permalink
[feature] add apache yarn monitor (#1937)
Browse files Browse the repository at this point in the history
Co-authored-by: zhangshenghang <shenghang.zhang@avrisdigital.com>
Co-authored-by: zhangshenghang <admin@hadoop.wiki>
Co-authored-by: crossoverJie <crossoverJie@gmail.com>
Co-authored-by: yqxxgh <42080876+yqxxgh@users.noreply.github.com>
Co-authored-by: Ceilzcx <48920254+Ceilzcx@users.noreply.github.com>
Co-authored-by: aias00 <rokkki@163.com>
Co-authored-by: tomsun28 <tomsun28@outlook.com>
Co-authored-by: Logic <zqr10159@dromara.org>
  • Loading branch information
9 people authored May 8, 2024
1 parent b2acd53 commit afec4bf
Show file tree
Hide file tree
Showing 4 changed files with 636 additions and 0 deletions.
83 changes: 83 additions & 0 deletions home/docs/help/yarn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
id: yarn
title: Monitoring Apache Yarn Monitoring
sidebar_label: Apache Yarn
keywords: [Big Data Monitoring System, Apache Yarn Monitoring, ResourceManager Monitoring]
---

> Hertzbeat monitors Apache Yarn node monitoring metrics.
**Protocol Used: HTTP**

## Pre-monitoring Actions

Retrieve the HTTP monitoring port of Apache Yarn. Value: `yarn.resourcemanager.webapp.address`

## Configuration Parameters

| Parameter Name | Parameter Description |
| ---------------- |----------------------------------------------------|
| Target Host | IP address, IPV6, or domain name of the monitored endpoint. Without protocol header. |
| Port | Monitoring port number of Apache Yarn, default is 8088. |
| Query Timeout | Timeout for querying Apache Yarn, in milliseconds, default is 6000 milliseconds. |
| Metrics Interval | Time interval for monitoring data collection, in seconds, minimum interval is 30 seconds. |

### Collected Metrics

#### Metric Set: ClusterMetrics

| Metric Name | Unit | Metric Description |
| ----------------------- | ---- | -----------------------------------------|
| NumActiveNMs | | Number of currently active NodeManagers |
| NumDecommissionedNMs | | Number of currently decommissioned NodeManagers |
| NumDecommissioningNMs | | Number of nodes currently decommissioning |
| NumLostNMs | | Number of lost nodes in the cluster |
| NumUnhealthyNMs | | Number of unhealthy nodes in the cluster |

#### Metric Set: JvmMetrics

| Metric Name | Unit | Metric Description |
| ----------------------- | ---- | -------------------------------------------- |
| MemNonHeapCommittedM | MB | Current committed size of non-heap memory in JVM |
| MemNonHeapMaxM | MB | Maximum available non-heap memory in JVM |
| MemNonHeapUsedM | MB | Current used size of non-heap memory in JVM |
| MemHeapCommittedM | MB | Current committed size of heap memory in JVM |
| MemHeapMaxM | MB | Maximum available heap memory in JVM |
| MemHeapUsedM | MB | Current used size of heap memory in JVM |
| GcTimeMillis | | JVM GC time |
| GcCount | | Number of JVM GC occurrences |

#### Metric Set: QueueMetrics

| Metric Name | Unit | Metric Description |
| --------------------------- | ---- | -------------------------------------------- |
| queue | | Queue name |
| AllocatedVCores | | Allocated virtual cores (allocated) |
| ReservedVCores | | Reserved cores |
| AvailableVCores | | Available cores (unallocated) |
| PendingVCores | | Blocked scheduling cores |
| AllocatedMB | MB | Allocated (used) memory size |
| AvailableMB | MB | Available memory (unallocated) |
| PendingMB | MB | Blocked scheduling memory |
| ReservedMB | MB | Reserved memory |
| AllocatedContainers | | Number of allocated (used) containers |
| PendingContainers | | Number of blocked scheduling containers |
| ReservedContainers | | Number of reserved containers |
| AggregateContainersAllocated| | Total aggregated containers allocated |
| AggregateContainersReleased| | Total aggregated containers released |
| AppsCompleted | | Number of completed applications |
| AppsKilled | | Number of killed applications |
| AppsFailed | | Number of failed applications |
| AppsPending | | Number of pending applications |
| AppsRunning | | Number of currently running applications |
| AppsSubmitted | | Number of submitted applications |
| running_0 | | Number of jobs running for less than 60 minutes |
| running_60 | | Number of jobs running between 60 and 300 minutes |
| running_300 | | Number of jobs running between 300 and 1440 minutes |
| running_1440 | | Number of jobs running for more than 1440 minutes |

#### Metric Set: runtime

| Metric Name | Unit | Metric Description |
| ----------------------- | ---- | --------------------------|
| StartTime | | Startup timestamp |
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
id: yarn
title: 监控:Apache Yarn监控
sidebar_label: Apache Yarn
keywords: [大数据监控系统, Apache Yarn监控, 资源管理器监控]
---

> Hertzbeat 对 Apache Yarn 节点监控指标进行监控。
**使用协议:HTTP**

## 监控前操作

获取 Apache Yarn 的 HTTP 监控端口。 取值:`yarn.resourcemanager.webapp.address`

## 配置参数

| 参数名称 | 参数帮助描述 |
| ---------------- |---------------------------------------|
| 目标Host | 被监控的对端IPV4,IPV6或域名。不带协议头。 |
| 端口 | Apache Yarn 的监控端口号,默认为8088。 |
| 查询超时时间 | 查询 Apache Yarn 的超时时间,单位毫秒,默认6000毫秒。 |
| 指标采集间隔 | 监控数据采集的时间间隔,单位秒,最小间隔为30秒。 |

### 采集指标

#### 指标集合:ClusterMetrics

| 指标名称 | 指标单位 | 指标帮助描述 |
| -------------------- | -------- | ---------------------------------- |
| NumActiveNMs | | 当前存活的 NodeManager 个数 |
| NumDecommissionedNMs | | 当前 Decommissioned 的 NodeManager 个数 |
| NumDecommissioningNMs| | 集群正在下线的节点数 |
| NumLostNMs | | 集群丢失的节点数 |
| NumUnhealthyNMs | | 集群不健康的节点数 |

#### 指标集合:JvmMetrics

| 指标名称 | 指标单位 | 指标帮助描述 |
| -------------------- | -------- | ------------------------------------ |
| MemNonHeapCommittedM | MB | JVM当前非堆内存大小已提交大小 |
| MemNonHeapMaxM | MB | JVM非堆最大可用内存 |
| MemNonHeapUsedM | MB | JVM当前已使用的非堆内存大小 |
| MemHeapCommittedM | MB | JVM当前已使用堆内存大小 |
| MemHeapMaxM | MB | JVM堆内存最大可用内存 |
| MemHeapUsedM | MB | JVM当前已使用堆内存大小 |
| GcTimeMillis | | JVM GC时间 |
| GcCount | | JVM GC次数 |

#### 指标集合:QueueMetrics

| 指标名称 | 指标单位 | 指标帮助描述 |
| ------------------------ | -------- | ------------------------------------ |
| queue | | 队列名称 |
| AllocatedVCores | | 分配的虚拟核数(已分配) |
| ReservedVCores | | 预留核数 |
| AvailableVCores | | 可用核数(尚未分配) |
| PendingVCores | | 阻塞调度核数 |
| AllocatedMB | MB | 已分配(已用)的内存大小 |
| AvailableMB | MB | 可用内存(尚未分配) |
| PendingMB | MB | 阻塞调度内存 |
| ReservedMB | MB | 预留内存 |
| AllocatedContainers | | 已分配(已用)的container数 |
| PendingContainers | | 阻塞调度container个数 |
| ReservedContainers | | 预留container数 |
| AggregateContainersAllocated | | 累积的container分配总数 |
| AggregateContainersReleased | | 累积的container释放总数 |
| AppsCompleted | | 完成的任务数 |
| AppsKilled | | 被杀掉的任务数 |
| AppsFailed | | 失败的任务数 |
| AppsPending | | 阻塞的任务数 |
| AppsRunning | | 提正在运行的任务数 |
| AppsSubmitted | | 提交过的任务数 |
| running_0 | | 运行时间小于60分钟的作业个数 |
| running_60 | | 运行时间介于60~300分钟的作业个数 |
| running_300 | | 运行时间介于300~1440分钟的作业个数 |
| running_1440 | | 运行时间大于1440分钟的作业个数 |

#### 指标集合:runtime

| 指标名称 | 指标单位 | 指标帮助描述 |
| -------------------- | -------- | ---------------------------- |
| StartTime | | 启动时间戳 |
1 change: 1 addition & 0 deletions home/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,7 @@
"help/doris_be",
"help/doris_fe",
"help/hadoop",
"help/yarn",
"help/hbase_master",
"help/hbase_regionserver",
"help/hdfs_namenode",
Expand Down
Loading

0 comments on commit afec4bf

Please sign in to comment.