在网络异常或机器异常后，容器lost后无法自动恢复 #26

zmberg · 2019-05-17T02:17:12Z

目前mesos的机制：mesos-master与mesos-slave之间保持一条tcp长链接，master使用心跳的方式判断mesos-slave的存活状态。当出现网络异常、mesos-slave退出或机器异常的情况下，这条tcp连接会断开，此时mesos-master会判断mesos-slave lost，并上报给bcs-scheduler。

当网络正常或mesos-slave正常后，mesos-master会恢复与slave的tcp连接，并继续心跳机制。
mesos对这种lost之后重新连接之后的slave，会采取直接shutdown的操作，杀掉上面所有的task容器，这种机制给业务带来了一些不太友好的体验，因为如果是网络异常，此时应该是恢复管控而不是杀掉

zmberg · 2019-05-17T02:18:55Z

解决思路：
scheduler在注册mesos framework的时候，如果把自己的Capabilities字段设置为FrameworkInfo_Capability_PARTITION_AWARE这个参数，对于这种lost之后的容器，mesos-master不会再做主动杀掉的动作，而是交给scheduler去处理。

并且对于这种现状的容器不再上报lost状态，而且具体区分为了UNREACHABLE、GONE、OPERATOR三种状态来表明lost不同的原因以及现象。

…issue #26

DeveloperJim · 2019-07-01T04:06:35Z

已合并，issue关闭

…issue #26

Merge from main repo

zmberg self-assigned this May 17, 2019

zmberg added this to the 1.13.x功能迭代 milestone May 17, 2019

zmberg added the enhancement New feature or request label May 17, 2019

zmberg pushed a commit that referenced this issue May 17, 2019

feature: 注册framework时，设置Capabilities为PARTITION_AWARE，正确处理lost状态task; …

db5f57c

…issue #26

DeveloperJim added confirmed issue is confirmed inner issue comes from Tencent side planning issue is under planning labels May 29, 2019

zmberg mentioned this issue May 29, 2019

Dev berg #31

Merged

tencent-adm unassigned zmberg Jun 4, 2019

DeveloperJim assigned zmberg Jul 1, 2019

DeveloperJim closed this as completed Jul 1, 2019

DeveloperJim mentioned this issue Jul 12, 2019

发布1.13.3 #94

Merged

DeveloperJim pushed a commit that referenced this issue Nov 4, 2019

feature: 注册framework时，设置Capabilities为PARTITION_AWARE，正确处理lost状态task; …

c2a0f03

…issue #26

DeveloperJim pushed a commit that referenced this issue Feb 4, 2021

Merge pull request #26 from Tencent/master

eb0af44

Merge from main repo

ifooth pushed a commit to ifooth/bk-bcs that referenced this issue Apr 10, 2024

补充以传统方式部署文档 (TencentBlueKing#26)

e70ee28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在网络异常或机器异常后，容器lost后无法自动恢复 #26

在网络异常或机器异常后，容器lost后无法自动恢复 #26

zmberg commented May 17, 2019

zmberg commented May 17, 2019

DeveloperJim commented Jul 1, 2019

在网络异常或机器异常后，容器lost后无法自动恢复 #26

在网络异常或机器异常后，容器lost后无法自动恢复 #26

Comments

zmberg commented May 17, 2019

zmberg commented May 17, 2019

DeveloperJim commented Jul 1, 2019