Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在网络异常或机器异常后,容器lost后无法自动恢复 #26

Closed
zmberg opened this issue May 17, 2019 · 2 comments
Closed

在网络异常或机器异常后,容器lost后无法自动恢复 #26

zmberg opened this issue May 17, 2019 · 2 comments
Assignees
Labels
confirmed issue is confirmed enhancement New feature or request inner issue comes from Tencent side planning issue is under planning

Comments

@zmberg
Copy link
Contributor

zmberg commented May 17, 2019

目前mesos的机制:mesos-master与mesos-slave之间保持一条tcp长链接,master使用心跳的方式判断mesos-slave的存活状态。当出现网络异常、mesos-slave退出或机器异常的情况下,这条tcp连接会断开,此时mesos-master会判断mesos-slave lost,并上报给bcs-scheduler。

当网络正常或mesos-slave正常后,mesos-master会恢复与slave的tcp连接,并继续心跳机制。
mesos对这种lost之后重新连接之后的slave,会采取直接shutdown的操作,杀掉上面所有的task容器,这种机制给业务带来了一些不太友好的体验,因为如果是网络异常,此时应该是恢复管控而不是杀掉

@zmberg
Copy link
Contributor Author

zmberg commented May 17, 2019

解决思路:
scheduler在注册mesos framework的时候,如果把自己的Capabilities字段设置为FrameworkInfo_Capability_PARTITION_AWARE这个参数,对于这种lost之后的容器,mesos-master不会再做主动杀掉的动作,而是交给scheduler去处理。

并且对于这种现状的容器不再上报lost状态,而且具体区分为了UNREACHABLE、GONE、OPERATOR三种状态来表明lost不同的原因以及现象。

@zmberg zmberg self-assigned this May 17, 2019
@zmberg zmberg added this to the 1.13.x功能迭代 milestone May 17, 2019
@zmberg zmberg added the enhancement New feature or request label May 17, 2019
@DeveloperJim DeveloperJim added confirmed issue is confirmed inner issue comes from Tencent side planning issue is under planning labels May 29, 2019
@zmberg zmberg mentioned this issue May 29, 2019
@DeveloperJim
Copy link
Collaborator

已合并,issue关闭

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed issue is confirmed enhancement New feature or request inner issue comes from Tencent side planning issue is under planning
Projects
None yet
Development

No branches or pull requests

2 participants