-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mesos executor ids now consist of the slave id and a counter to fix dupl... #1358
Conversation
…uplicate id problems
Can one of the admins verify this patch? |
Hey Dario - do you mind describing a bit more the problem this fixes (ideally create a JIRA for it) and what the symptoms are. |
Hi Patrick, the problem is described in this mailing list entry If I understand the documentation on run modes and the code correctly, in fine grained mode it starts a separate instance of On this line in the |
Created a JIRA issue here: https://issues.apache.org/jira/browse/SPARK-2445 |
Jenkins, test this please |
QA tests have started for PR 1358. This patch merges cleanly. |
@@ -250,7 +252,7 @@ private[spark] class MesosSchedulerBackend( | |||
MesosTaskInfo.newBuilder() | |||
.setTaskId(taskId) | |||
.setSlaveId(SlaveID.newBuilder().setValue(slaveId).build()) | |||
.setExecutor(createExecutorInfo(slaveId)) | |||
.setExecutor(createExecutorInfo(nextExecutorId(slaveId))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this change keep launching a new executor for each task? We want to reuse our Mesos executors
So I don't quite understand, how can multiple executors be launched for the same Spark application on the same node right now? I thought we always reuse our executor across tasks. |
QA results for PR 1358: |
@mateiz: You are right. I don't see how an executor could be started more than once per slave, but it seems to happen sometimes (see the mailing list entry). I will close this PR and try to further investigate this. Thanks! |
Sure, if you find it, let me know. |
We've run into this issue a handful of times including once today - is it possible the bug is in Mesos? |
I have encountered this issue: Here are 2 lines from my Driver log file: 2014-09-10 18:50:44,510 ERROR [spark-akka.actor.default-dispatcher-46] TaskSchedulerImpl - Lost executor 201408311047-3690990090-5050-30951-12 on spark106.us.taboolasyndication.com: remote Akka client disassociated 2014-09-10 18:51:46,062 ERROR [spark-akka.actor.default-dispatcher-15] BlockManagerMasterActor - Got two different block manager registrations on 201408311047-3690990090-5050-30951-12 Looks like Driver is disassociated from Spark worker. |
Yep, also hitting this same problem. We're running Spark 1.0.2 and Mesos 0.20.0. From a quick analysis, it looks like a bug in Spark. |
It seems that this is a symptom of the following issue: |
I see, so maybe the problem is that an executor dies, and another is launched on the same Mesos machine with the same executor ID, which then breaks assumptions elsewhere in the code. In that case, our executor ID would need to be something like (Mesos executor ID) + (our attempt # on this executor). But you'd need to look throughout the MesosScheduler code and make sure this works -- in particular we have to send back the right ID when we launch tasks. |
BTW the delta from the original pull request would be that we only increment our counter when the old executor fails. If you want to implement that, please create a JIRA for it and send a new PR. |
@mateiz - @KashiErez and I went on a different route. The killer issue was that there is a System.exit(1) in BlockManagerMasterActor which was a huge robustness issue for us. @taboola we are running some pretty large clusters (process many tera bytes of data / day) which do real time calculations and are mission critical. So - we fixed the issue and it's been running successfully in our production for a while now. I opened a new ticket - https://issues.apache.org/jira/browse/SPARK-4006 What do you think about our fix? |
@tsliwowicz your fix seems good -- thanks for getting to the bottom of this! |
...icate id problems