Koordinator支持YARN生态与K8s混部 #1297
Replies: 29 comments 1 reply
-
0524
本次议题
|
Beta Was this translation helpful? Give feedback.
-
0531 |
Beta Was this translation helpful? Give feedback.
-
0607 预计下周可以分模块调通,在裸机上可以跑通 |
Beta Was this translation helpful? Give feedback.
-
0614 2、单机侧 3、环境相关 下周计划 |
Beta Was this translation helpful? Give feedback.
-
0620 优化项: |
Beta Was this translation helpful? Give feedback.
-
0628
驱逐逻辑待测试 @suozengzeng 待确认: |
Beta Was this translation helpful? Give feedback.
-
0705 单机侧: 环境问题: milestone 1.1
|
Beta Was this translation helpful? Give feedback.
-
0712 单机侧: 环境问题: emr白天下线后需要让batch pod上来,emr夜间上线后需要让batch pod下线。待讨论 milestone 1.1(7-9)
|
Beta Was this translation helpful? Give feedback.
-
0719 单机侧: 环境问题: emr白天下线后需要让batch pod上来,emr夜间上线后需要让batch pod下线。结论:离线链路反向同步,不侵入调度器。 milestone 1.1: |
Beta Was this translation helpful? Give feedback.
-
0726 环境问题: emr白天下线后需要让batch pod上来,emr夜间上线后需要让batch pod下线。结论:离线链路反向同步,不侵入调度器。 milestone 1.1(7-8月) 单机侧: controller 反向同步 yarn 申请量 |
Beta Was this translation helpful? Give feedback.
-
0808 环境问题: emr白天下线后需要让batch pod上来,emr夜间上线后需要让batch pod下线。结论:离线链路反向同步,不侵入调度器。 milestone 1.1( 7-8月) controller 反向同步 yarn 申请量 |
Beta Was this translation helpful? Give feedback.
-
0816 环境问题: emr白天下线后需要让batch pod上来,emr夜间上线后需要让batch pod下线。结论:离线链路反向同步,不侵入调度器。 milestone 1.1( 7-8月) |
Beta Was this translation helpful? Give feedback.
-
0823 环境问题: emr白天下线后需要让batch pod上来,emr夜间上线后需要让batch pod下线。结论:离线链路反向同步,不侵入调度器。(验证中) milestone 1.1( 7-8月) |
Beta Was this translation helpful? Give feedback.
-
0823 环境问题: emr白天下线后需要让batch pod上来,emr夜间上线后需要让batch pod下线。结论:离线链路反向同步,不侵入调度器。(验证中) milestone 1.1( 7-8月) |
Beta Was this translation helpful? Give feedback.
-
0906 环境问题: 资源 DIff 问题已复现,定位中 emr白天下线后需要让batch pod上来,emr夜间上线后需要让batch pod下线。结论:离线链路反向同步,不侵入调度器,已上线,资源冲突驱逐来解决 milestone 1.1( 7-8月) =============================================================== 九月计划:
|
Beta Was this translation helpful? Give feedback.
-
0913 环境问题: 资源 DIff 问题已复现,定位中,keep doing milestone 1.1( 7-8月) =============================================================== 九月-十月计划: 接入规模保持不变 |
Beta Was this translation helpful? Give feedback.
-
0920 环境问题: 资源 DIff 问题等待复现 milestone 1.1( 7-8月) =============================================================== 九月-十月计划: 接入规模保持不变 资源运营问题(长期) |
Beta Was this translation helpful? Give feedback.
-
1011 环境问题: 资源 DIff 问题等待复现 milestone 1.1( 7-8月) =============================================================== 九月-十月计划: 其他单机QoS策略适配YARN Task:内存回收分级( 包括 batch pod )(社区koordlet版本待适配)功能待拆解 资源运营问题(长期) |
Beta Was this translation helpful? Give feedback.
-
1018 环境问题:
九月-十月计划: 其他单机QoS策略适配YARN Task:内存回收分级( 包括 batch pod )(社区koordlet版本待适配)功能待拆解 资源运营问题(长期) |
Beta Was this translation helpful? Give feedback.
-
@zwzhang0107 Sorry to bother, where is the source code for this discussion? I couldn't find it in the repository. |
Beta Was this translation helpful? Give feedback.
-
1025 环境问题: 资源 DIff 问题等待复现 九月-十月计划:
其他单机QoS策略适配YARN Task:内存回收分级( 包括 batch pod )(社区koordlet版本待适配)功能待拆解 资源运营问题(长期) |
Beta Was this translation helpful? Give feedback.
-
1025 裸金属接入问题:Yarn Task 卡死 环境问题: 资源 DIff 问题等待复现 九月-十月计划: release-v0.1 @zwzhang0107 资源运营问题(长期) |
Beta Was this translation helpful? Give feedback.
-
1115 环境问题: 资源 DIff 问题等待复现 九月-十月计划: release-v0.1 @zwzhang0107 release-v0.2 资源运营问题(长期) |
Beta Was this translation helpful? Give feedback.
-
1129 环境问题: 资源 DIff 问题等待复现 release-v0.1 已发布 release-v0.2 资源运营问题(长期) |
Beta Was this translation helpful? Give feedback.
-
1129
环境问题: 资源 DIff 问题等待复现 release-v0.2 资源运营问题(长期) |
Beta Was this translation helpful? Give feedback.
-
12/20 内存带宽限制,解决方案:EMR独占3%内存带宽(每个CCD设置为 60),并且驱逐超卖的离线 环境问题: 资源 DIff 问题等待复现 release-v0.2 资源运营问题(长期) |
Beta Was this translation helpful? Give feedback.
-
1月3日 hadoop-yarn cgroup 目录 cpu.shares 为 1024, Copilot SUM(all container request)done 代码提社区 (todo) 环境问题: 资源 DIff 问题等待复现 已上线,待验证 release-v0.2 资源运营问题(长期) |
Beta Was this translation helpful? Give feedback.
-
联系人:1196118994@qq.com |
Beta Was this translation helpful? Give feedback.
-
对于node manager镜像的更新可能对在运行的服务造成影响,保险起见,需要先对node manager 节点执行decommission操作,但是dec操作和pod的更新无法相互配合。目前hadoop nodemanager是由deployment控制的,如果改为使用open kruise的cloneset则可以通过selective-pod-deletion做到和decommision动作配合,当dec操作完成删除对应的pod。如koordinator实现CRD(比如NodeManagerSet),感觉也需要类似能力。 |
Beta Was this translation helpful? Give feedback.
-
背景
Haddop YARN架构
Koordiantor混部超卖模型
混部场景
总体原则
关键设计
资源超卖与分配
节点运行时管理
其他cgroup管理机制
单机QoS保障策略
由于离线频繁起停,koord-yarn-copilot需要考虑要kill的task已经不存在的情况。例如koordlet将需要回收的内存量和sorted task list发送给koord-yarn-copilot。
考虑给haoop-yarn整组设置memory.limit,控制oom范围降低稳定性风险。
方案二:直接将node manager的进程id设置到resctrl-BE分组,需要确定container级别是否会自动继承
YARN 集群运维
组件管理
其他方案
其他方案 YARN-RM继续保留在ECS模式下托管,RM同节点需要部署Koordiantor工具套装,负责节点摘除等需要修改本地配置的操作。两者相比,RM用Pod方式托管的优点在于统一维护,方便管理,YARN组件由K8s生态全托管,缺点在于对现有的运维链路改造会更大;而RM用ECS托管的优点在于对现有运维链路的改动较小,但缺点在于运维操作分散,维护复杂度高,操作和配置流程会有一部分转移到YARN Spec中,另一部分保留在原流程不变,管理上容易出现冲突。
日常运维
通常包括以下场景:
与业务相关的运维操作管理保持不变,运维组件和YARN RM/NM在同一个容器内部署,操作YARN RM/NM。
与节点扩缩容相关的参考组件管理流程,通过Koordinator定义的CRD扩容
演进方案
阶段1:Koordinator只托管Node Manager,不含运维控制链路
阶段2:YARN-RM的托管部署纳入到Koordiantor,支持运维链路对接
阶段0:最小可用版本,不支持YARN组件自动部署和运维操作
阶段性目标
milestone 1.0 (6.10功能研发完成)
功能拆解
milestone 1.1 (精细化策略-8月中)
milestone 1.2 (NM管理)
milestone 2.0
TBD
待讨论议题
a. kilo,取自航海过程中使用的旗语单词,两艘船碰面时打出kilo,表示“I wish to communicate with you.”,意在希望K8s和YARN生态能够在混部场景可以及时互通,稳定运行。
b. conductor,指挥家
a. pkg/kilo-controller,用于koordiantor和yarn之间资源同步的controller
b. pkg/yarn-operator,用于yarn组件管理
c. pkg/koord-copilot-yarn,负责单机侧与YARN-NM交互,Task信息管理/驱逐
Beta Was this translation helpful? Give feedback.
All reactions