Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] add apache yarn monitor #1937

Merged
merged 16 commits into from
May 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions home/docs/help/yarn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
id: yarn
title: Monitoring Apache Yarn Monitoring
sidebar_label: Apache Yarn
keywords: [Big Data Monitoring System, Apache Yarn Monitoring, ResourceManager Monitoring]
---

> Hertzbeat monitors Apache Yarn node monitoring metrics.

**Protocol Used: HTTP**

## Pre-monitoring Actions

Retrieve the HTTP monitoring port of Apache Yarn. Value: `yarn.resourcemanager.webapp.address`

## Configuration Parameters

| Parameter Name | Parameter Description |
| ---------------- |----------------------------------------------------|
| Target Host | IP address, IPV6, or domain name of the monitored endpoint. Without protocol header. |
| Port | Monitoring port number of Apache Yarn, default is 8088. |
| Query Timeout | Timeout for querying Apache Yarn, in milliseconds, default is 6000 milliseconds. |
| Metrics Interval | Time interval for monitoring data collection, in seconds, minimum interval is 30 seconds. |

### Collected Metrics

#### Metric Set: ClusterMetrics

| Metric Name | Unit | Metric Description |
| ----------------------- | ---- | -----------------------------------------|
| NumActiveNMs | | Number of currently active NodeManagers |
| NumDecommissionedNMs | | Number of currently decommissioned NodeManagers |
| NumDecommissioningNMs | | Number of nodes currently decommissioning |
| NumLostNMs | | Number of lost nodes in the cluster |
| NumUnhealthyNMs | | Number of unhealthy nodes in the cluster |

#### Metric Set: JvmMetrics

| Metric Name | Unit | Metric Description |
| ----------------------- | ---- | -------------------------------------------- |
| MemNonHeapCommittedM | MB | Current committed size of non-heap memory in JVM |
| MemNonHeapMaxM | MB | Maximum available non-heap memory in JVM |
| MemNonHeapUsedM | MB | Current used size of non-heap memory in JVM |
| MemHeapCommittedM | MB | Current committed size of heap memory in JVM |
| MemHeapMaxM | MB | Maximum available heap memory in JVM |
| MemHeapUsedM | MB | Current used size of heap memory in JVM |
| GcTimeMillis | | JVM GC time |
| GcCount | | Number of JVM GC occurrences |

#### Metric Set: QueueMetrics

| Metric Name | Unit | Metric Description |
| --------------------------- | ---- | -------------------------------------------- |
| queue | | Queue name |
| AllocatedVCores | | Allocated virtual cores (allocated) |
| ReservedVCores | | Reserved cores |
| AvailableVCores | | Available cores (unallocated) |
| PendingVCores | | Blocked scheduling cores |
| AllocatedMB | MB | Allocated (used) memory size |
| AvailableMB | MB | Available memory (unallocated) |
| PendingMB | MB | Blocked scheduling memory |
| ReservedMB | MB | Reserved memory |
| AllocatedContainers | | Number of allocated (used) containers |
| PendingContainers | | Number of blocked scheduling containers |
| ReservedContainers | | Number of reserved containers |
| AggregateContainersAllocated| | Total aggregated containers allocated |
| AggregateContainersReleased| | Total aggregated containers released |
| AppsCompleted | | Number of completed applications |
| AppsKilled | | Number of killed applications |
| AppsFailed | | Number of failed applications |
| AppsPending | | Number of pending applications |
| AppsRunning | | Number of currently running applications |
| AppsSubmitted | | Number of submitted applications |
| running_0 | | Number of jobs running for less than 60 minutes |
| running_60 | | Number of jobs running between 60 and 300 minutes |
| running_300 | | Number of jobs running between 300 and 1440 minutes |
| running_1440 | | Number of jobs running for more than 1440 minutes |

#### Metric Set: runtime

| Metric Name | Unit | Metric Description |
| ----------------------- | ---- | --------------------------|
| StartTime | | Startup timestamp |
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
id: yarn
title: 监控:Apache Yarn监控
sidebar_label: Apache Yarn
keywords: [大数据监控系统, Apache Yarn监控, 资源管理器监控]
---

> Hertzbeat 对 Apache Yarn 节点监控指标进行监控。

**使用协议:HTTP**

## 监控前操作

获取 Apache Yarn 的 HTTP 监控端口。 取值:`yarn.resourcemanager.webapp.address`

## 配置参数

| 参数名称 | 参数帮助描述 |
| ---------------- |---------------------------------------|
| 目标Host | 被监控的对端IPV4,IPV6或域名。不带协议头。 |
| 端口 | Apache Yarn 的监控端口号,默认为8088。 |
| 查询超时时间 | 查询 Apache Yarn 的超时时间,单位毫秒,默认6000毫秒。 |
| 指标采集间隔 | 监控数据采集的时间间隔,单位秒,最小间隔为30秒。 |

### 采集指标

#### 指标集合:ClusterMetrics

| 指标名称 | 指标单位 | 指标帮助描述 |
| -------------------- | -------- | ---------------------------------- |
| NumActiveNMs | | 当前存活的 NodeManager 个数 |
| NumDecommissionedNMs | | 当前 Decommissioned 的 NodeManager 个数 |
| NumDecommissioningNMs| | 集群正在下线的节点数 |
| NumLostNMs | | 集群丢失的节点数 |
| NumUnhealthyNMs | | 集群不健康的节点数 |

#### 指标集合:JvmMetrics

| 指标名称 | 指标单位 | 指标帮助描述 |
| -------------------- | -------- | ------------------------------------ |
| MemNonHeapCommittedM | MB | JVM当前非堆内存大小已提交大小 |
| MemNonHeapMaxM | MB | JVM非堆最大可用内存 |
| MemNonHeapUsedM | MB | JVM当前已使用的非堆内存大小 |
| MemHeapCommittedM | MB | JVM当前已使用堆内存大小 |
| MemHeapMaxM | MB | JVM堆内存最大可用内存 |
| MemHeapUsedM | MB | JVM当前已使用堆内存大小 |
| GcTimeMillis | | JVM GC时间 |
| GcCount | | JVM GC次数 |

#### 指标集合:QueueMetrics

| 指标名称 | 指标单位 | 指标帮助描述 |
| ------------------------ | -------- | ------------------------------------ |
| queue | | 队列名称 |
| AllocatedVCores | | 分配的虚拟核数(已分配) |
| ReservedVCores | | 预留核数 |
| AvailableVCores | | 可用核数(尚未分配) |
| PendingVCores | | 阻塞调度核数 |
| AllocatedMB | MB | 已分配(已用)的内存大小 |
| AvailableMB | MB | 可用内存(尚未分配) |
| PendingMB | MB | 阻塞调度内存 |
| ReservedMB | MB | 预留内存 |
| AllocatedContainers | | 已分配(已用)的container数 |
| PendingContainers | | 阻塞调度container个数 |
| ReservedContainers | | 预留container数 |
| AggregateContainersAllocated | | 累积的container分配总数 |
| AggregateContainersReleased | | 累积的container释放总数 |
| AppsCompleted | | 完成的任务数 |
| AppsKilled | | 被杀掉的任务数 |
| AppsFailed | | 失败的任务数 |
| AppsPending | | 阻塞的任务数 |
| AppsRunning | | 提正在运行的任务数 |
| AppsSubmitted | | 提交过的任务数 |
| running_0 | | 运行时间小于60分钟的作业个数 |
| running_60 | | 运行时间介于60~300分钟的作业个数 |
| running_300 | | 运行时间介于300~1440分钟的作业个数 |
| running_1440 | | 运行时间大于1440分钟的作业个数 |

#### 指标集合:runtime

| 指标名称 | 指标单位 | 指标帮助描述 |
| -------------------- | -------- | ---------------------------- |
| StartTime | | 启动时间戳 |
1 change: 1 addition & 0 deletions home/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,7 @@
"help/doris_be",
"help/doris_fe",
"help/hadoop",
"help/yarn",
"help/hbase_master",
"help/hbase_regionserver",
"help/hdfs_namenode",
Expand Down
Loading
Loading