Skip to content

Commit

Permalink
[SPARK-4006] In long running contexts, we encountered the situation o…
Browse files Browse the repository at this point in the history
…f d...

...ouble registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2886 except it's on branch-1.0

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes #2914 from tsliwowicz/branch-1.0-block-mgr-removal and squashes the following commits:

1014493 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
  • Loading branch information
tsliwowicz authored and Andrew Or committed Dec 17, 2014
1 parent b9b6762 commit f0eed6e
Showing 1 changed file with 13 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@ class BlockManagerMasterActor(val isLocal: Boolean, conf: SparkConf, listenerBus
}
}
listenerBus.post(SparkListenerBlockManagerRemoved(blockManagerId))
logInfo(s"Removing block manager $blockManagerId")
}

private def expireDeadHosts() {
Expand Down Expand Up @@ -328,16 +329,20 @@ class BlockManagerMasterActor(val isLocal: Boolean, conf: SparkConf, listenerBus
private def register(id: BlockManagerId, maxMemSize: Long, slaveActor: ActorRef) {
if (!blockManagerInfo.contains(id)) {
blockManagerIdByExecutor.get(id.executorId) match {
case Some(manager) =>
// A block manager of the same executor already exists.
// This should never happen. Let's just quit.
logError("Got two different block manager registrations on " + id.executorId)
System.exit(1)
case Some(oldId) =>
// A block manager of the same executor already exists, so remove it (assumed dead)
logError("Got two different block manager registrations on same executor - "
+ s" will replace old one $oldId with new one $id")
removeExecutor(id.executorId)
case None =>
blockManagerIdByExecutor(id.executorId) = id
}
blockManagerInfo(id) =
new BlockManagerInfo(id, System.currentTimeMillis(), maxMemSize, slaveActor)
logInfo("Registering block manager %s with %s RAM, %s".format(
id.hostPort, Utils.bytesToString(maxMemSize), id))

blockManagerIdByExecutor(id.executorId) = id

blockManagerInfo(id) = new BlockManagerInfo(
id, System.currentTimeMillis(), maxMemSize, slaveActor)
}
listenerBus.post(SparkListenerBlockManagerAdded(id, maxMemSize))
}
Expand Down

0 comments on commit f0eed6e

Please sign in to comment.