mesos executor ids now consist of the slave id and a counter to fix dupl... #1358

drexin · 2014-07-10T16:40:23Z

...icate id problems

…uplicate id problems

AmplabJenkins · 2014-07-10T16:41:15Z

Can one of the admins verify this patch?

pwendell · 2014-07-11T07:26:47Z

Hey Dario - do you mind describing a bit more the problem this fixes (ideally create a JIRA for it) and what the symptoms are.

drexin · 2014-07-11T08:32:08Z

Hi Patrick,

the problem is described in this mailing list entry

If I understand the documentation on run modes and the code correctly, in fine grained mode it starts a separate instance of MesosExecutorBackend for each spark task. If this is correct, then as soon as 2 tasks run concurrently on the same machine we should run into this problem.

On this line in the BlockManagerMasterActor, there is a check on the BlockManagerId, which will always be different per Executor instance, because the port in there is randomly assigned. The executorId however is always set to the mesos slaveId. This means that we are running into this case as soon as we start two Executor instances on the same slave. This PR fixes this by adding the counter to the executorId. Please tell me if I overlooked something.

drexin · 2014-07-11T08:40:19Z

Created a JIRA issue here: https://issues.apache.org/jira/browse/SPARK-2445

mateiz · 2014-07-26T22:12:42Z

Jenkins, test this please

SparkQA · 2014-07-26T22:18:43Z

QA tests have started for PR 1358. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17229/consoleFull

mateiz · 2014-07-26T22:36:50Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala

@@ -250,7 +252,7 @@ private[spark] class MesosSchedulerBackend(
    MesosTaskInfo.newBuilder()
      .setTaskId(taskId)
      .setSlaveId(SlaveID.newBuilder().setValue(slaveId).build())
-      .setExecutor(createExecutorInfo(slaveId))
+      .setExecutor(createExecutorInfo(nextExecutorId(slaveId)))


Won't this change keep launching a new executor for each task? We want to reuse our Mesos executors

mateiz · 2014-07-26T22:37:48Z

So I don't quite understand, how can multiple executors be launched for the same Spark application on the same node right now? I thought we always reuse our executor across tasks.

SparkQA · 2014-07-26T22:58:24Z

QA results for PR 1358:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17229/consoleFull

drexin · 2014-07-28T09:47:40Z

@mateiz: You are right. I don't see how an executor could be started more than once per slave, but it seems to happen sometimes (see the mailing list entry). I will close this PR and try to further investigate this. Thanks!

mateiz · 2014-07-28T22:54:44Z

Sure, if you find it, let me know.

gmalouf · 2014-08-20T20:14:37Z

We've run into this issue a handful of times including once today - is it possible the bug is in Mesos?

KashiErez · 2014-09-11T08:59:11Z

I have encountered this issue:
We have a 24/7 Spark running job on Mesos.
It happens every 1-3 days.

Here are 2 lines from my Driver log file:

2014-09-10 18:50:44,510 ERROR [spark-akka.actor.default-dispatcher-46] TaskSchedulerImpl - Lost executor 201408311047-3690990090-5050-30951-12 on spark106.us.taboolasyndication.com: remote Akka client disassociated

2014-09-10 18:51:46,062 ERROR [spark-akka.actor.default-dispatcher-15] BlockManagerMasterActor - Got two different block manager registrations on 201408311047-3690990090-5050-30951-12

Looks like Driver is disassociated from Spark worker.
One minuted duplicated block manager registration happens.

brndnmtthws · 2014-09-11T18:55:35Z

Yep, also hitting this same problem. We're running Spark 1.0.2 and Mesos 0.20.0.

From a quick analysis, it looks like a bug in Spark.

brndnmtthws · 2014-09-15T21:12:46Z

It seems that this is a symptom of the following issue:

https://issues.apache.org/jira/browse/SPARK-3535

mateiz · 2014-09-18T00:00:08Z

I see, so maybe the problem is that an executor dies, and another is launched on the same Mesos machine with the same executor ID, which then breaks assumptions elsewhere in the code. In that case, our executor ID would need to be something like (Mesos executor ID) + (our attempt # on this executor). But you'd need to look throughout the MesosScheduler code and make sure this works -- in particular we have to send back the right ID when we launch tasks.

mateiz · 2014-09-18T00:01:04Z

BTW the delta from the original pull request would be that we only increment our counter when the old executor fails. If you want to implement that, please create a JIRA for it and send a new PR.

tsliwowicz · 2014-10-20T10:31:34Z

@mateiz - @KashiErez and I went on a different route. The killer issue was that there is a System.exit(1) in BlockManagerMasterActor which was a huge robustness issue for us. @taboola we are running some pretty large clusters (process many tera bytes of data / day) which do real time calculations and are mission critical. So - we fixed the issue and it's been running successfully in our production for a while now.

I opened a new ticket - https://issues.apache.org/jira/browse/SPARK-4006
And a pull request - #2854

What do you think about our fix?

mateiz · 2014-10-23T23:58:01Z

@tsliwowicz your fix seems good -- thanks for getting to the bottom of this!

mesos executor ids now consist of the slave id and a counter to fix d…

dde05f3

…uplicate id problems

mateiz reviewed Jul 26, 2014
View reviewed changes

drexin closed this Jul 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mesos executor ids now consist of the slave id and a counter to fix dupl... #1358

mesos executor ids now consist of the slave id and a counter to fix dupl... #1358

drexin commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

pwendell commented Jul 11, 2014

drexin commented Jul 11, 2014

drexin commented Jul 11, 2014

mateiz commented Jul 26, 2014

SparkQA commented Jul 26, 2014

mateiz Jul 26, 2014

mateiz commented Jul 26, 2014

SparkQA commented Jul 26, 2014

drexin commented Jul 28, 2014

mateiz commented Jul 28, 2014

gmalouf commented Aug 20, 2014

KashiErez commented Sep 11, 2014

brndnmtthws commented Sep 11, 2014

brndnmtthws commented Sep 15, 2014

mateiz commented Sep 18, 2014

mateiz commented Sep 18, 2014

tsliwowicz commented Oct 20, 2014

mateiz commented Oct 23, 2014

mesos executor ids now consist of the slave id and a counter to fix dupl... #1358

mesos executor ids now consist of the slave id and a counter to fix dupl... #1358

Conversation

drexin commented Jul 10, 2014

AmplabJenkins commented Jul 10, 2014

pwendell commented Jul 11, 2014

drexin commented Jul 11, 2014

drexin commented Jul 11, 2014

mateiz commented Jul 26, 2014

SparkQA commented Jul 26, 2014

mateiz Jul 26, 2014

Choose a reason for hiding this comment

mateiz commented Jul 26, 2014

SparkQA commented Jul 26, 2014

drexin commented Jul 28, 2014

mateiz commented Jul 28, 2014

gmalouf commented Aug 20, 2014

KashiErez commented Sep 11, 2014

brndnmtthws commented Sep 11, 2014

brndnmtthws commented Sep 15, 2014

mateiz commented Sep 18, 2014

mateiz commented Sep 18, 2014

tsliwowicz commented Oct 20, 2014

mateiz commented Oct 23, 2014