Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mesos executor ids now consist of the slave id and a counter to fix dupl... #1358

Closed

Conversation

drexin
Copy link

@drexin drexin commented Jul 10, 2014

...icate id problems

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@pwendell
Copy link
Contributor

Hey Dario - do you mind describing a bit more the problem this fixes (ideally create a JIRA for it) and what the symptoms are.

@drexin
Copy link
Author

drexin commented Jul 11, 2014

Hi Patrick,

the problem is described in this mailing list entry

If I understand the documentation on run modes and the code correctly, in fine grained mode it starts a separate instance of MesosExecutorBackend for each spark task. If this is correct, then as soon as 2 tasks run concurrently on the same machine we should run into this problem.

On this line in the BlockManagerMasterActor, there is a check on the BlockManagerId, which will always be different per Executor instance, because the port in there is randomly assigned. The executorId however is always set to the mesos slaveId. This means that we are running into this case as soon as we start two Executor instances on the same slave. This PR fixes this by adding the counter to the executorId. Please tell me if I overlooked something.

@drexin
Copy link
Author

drexin commented Jul 11, 2014

Created a JIRA issue here: https://issues.apache.org/jira/browse/SPARK-2445

@mateiz
Copy link
Contributor

mateiz commented Jul 26, 2014

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Jul 26, 2014

QA tests have started for PR 1358. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17229/consoleFull

@@ -250,7 +252,7 @@ private[spark] class MesosSchedulerBackend(
MesosTaskInfo.newBuilder()
.setTaskId(taskId)
.setSlaveId(SlaveID.newBuilder().setValue(slaveId).build())
.setExecutor(createExecutorInfo(slaveId))
.setExecutor(createExecutorInfo(nextExecutorId(slaveId)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this change keep launching a new executor for each task? We want to reuse our Mesos executors

@mateiz
Copy link
Contributor

mateiz commented Jul 26, 2014

So I don't quite understand, how can multiple executors be launched for the same Spark application on the same node right now? I thought we always reuse our executor across tasks.

@SparkQA
Copy link

SparkQA commented Jul 26, 2014

QA results for PR 1358:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17229/consoleFull

@drexin
Copy link
Author

drexin commented Jul 28, 2014

@mateiz: You are right. I don't see how an executor could be started more than once per slave, but it seems to happen sometimes (see the mailing list entry). I will close this PR and try to further investigate this. Thanks!

@drexin drexin closed this Jul 28, 2014
@mateiz
Copy link
Contributor

mateiz commented Jul 28, 2014

Sure, if you find it, let me know.

@gmalouf
Copy link

gmalouf commented Aug 20, 2014

We've run into this issue a handful of times including once today - is it possible the bug is in Mesos?

@KashiErez
Copy link

I have encountered this issue:
We have a 24/7 Spark running job on Mesos.
It happens every 1-3 days.

Here are 2 lines from my Driver log file:

2014-09-10 18:50:44,510 ERROR [spark-akka.actor.default-dispatcher-46] TaskSchedulerImpl - Lost executor 201408311047-3690990090-5050-30951-12 on spark106.us.taboolasyndication.com: remote Akka client disassociated

2014-09-10 18:51:46,062 ERROR [spark-akka.actor.default-dispatcher-15] BlockManagerMasterActor - Got two different block manager registrations on 201408311047-3690990090-5050-30951-12

Looks like Driver is disassociated from Spark worker.
One minuted duplicated block manager registration happens.

@brndnmtthws
Copy link
Member

Yep, also hitting this same problem. We're running Spark 1.0.2 and Mesos 0.20.0.

From a quick analysis, it looks like a bug in Spark.

@brndnmtthws
Copy link
Member

It seems that this is a symptom of the following issue:

https://issues.apache.org/jira/browse/SPARK-3535

@mateiz
Copy link
Contributor

mateiz commented Sep 18, 2014

I see, so maybe the problem is that an executor dies, and another is launched on the same Mesos machine with the same executor ID, which then breaks assumptions elsewhere in the code. In that case, our executor ID would need to be something like (Mesos executor ID) + (our attempt # on this executor). But you'd need to look throughout the MesosScheduler code and make sure this works -- in particular we have to send back the right ID when we launch tasks.

@mateiz
Copy link
Contributor

mateiz commented Sep 18, 2014

BTW the delta from the original pull request would be that we only increment our counter when the old executor fails. If you want to implement that, please create a JIRA for it and send a new PR.

@tsliwowicz
Copy link
Contributor

@mateiz - @KashiErez and I went on a different route. The killer issue was that there is a System.exit(1) in BlockManagerMasterActor which was a huge robustness issue for us. @taboola we are running some pretty large clusters (process many tera bytes of data / day) which do real time calculations and are mission critical. So - we fixed the issue and it's been running successfully in our production for a while now.

I opened a new ticket - https://issues.apache.org/jira/browse/SPARK-4006
And a pull request - #2854

What do you think about our fix?

@mateiz
Copy link
Contributor

mateiz commented Oct 23, 2014

@tsliwowicz your fix seems good -- thanks for getting to the bottom of this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants