-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4006] Block Manager - Double Register Crash #2854
[SPARK-4006] Block Manager - Double Register Crash #2854
Conversation
…ster without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister.
Can one of the admins verify this patch? |
Hey @tsliwowicz thanks for fixing this inconsistency. Since this is an issue affecting the most recent version of Spark as well, would you mind opening a PR against the master branch rather than against 0.9? It will allow us to merge this more easily into branches 1.0, 1.1, and master. |
logError("Got two different block manager registrations on " + id.executorId) | ||
System.exit(1) | ||
// A block manager of the same executor already exists so remove it (assumed dead). | ||
logError("Got two different block manager registrations on same executor - will remove, new Id " + id+", orig id - "+manager) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a 100 character limit per line for Spark PRs.
@andrewor14 - thanks, and sure - I will fix your comments and do a PR against master. |
Can one of the admins verify this patch? |
Created another pull request - #2886 - this time on master and also fixed the comments above. |
Jenkins, add to whitelist |
Hey @tsliwowicz can you make the same changes I suggested in #2886 here? |
QA tests have started for PR 2854 at commit
|
QA tests have finished for PR 2854 at commit
|
Test FAILed. |
…f double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like #2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.
…f double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like apache/spark#2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. (cherry picked from commit 6b48522) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
…f double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like apache/spark#2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. (cherry picked from commit 6b48522) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala (cherry picked from commit d122236) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
…f double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like apache/spark#2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. (cherry picked from commit 6b48522) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
QA tests have started for PR 2854 at commit
|
QA tests have finished for PR 2854 at commit
|
Test FAILed. |
there seems to be some technical issue with the build. (not a real failure with the pull request itself) |
Jenkins, retest this please. |
Test build #24148 has started for PR 2854 at commit
|
Test build #24148 has finished for PR 2854 at commit
|
Test FAILed. |
Test build #24712 has finished for PR 2854 at commit
|
Test FAILed. |
retest this please... |
Test build #24714 has started for PR 2854 at commit
|
Test build #24714 has finished for PR 2854 at commit
|
Test FAILed. |
Jenkins, retest this please. |
Hmm, it's not obvious to me what failed on that last run. I guess I'll just have Jenkins retest it. Pretty sure that it's not an issue in this PR, but it doesn't cost anything to just try again. |
Test build #24756 has started for PR 2854 at commit
|
Test build #24756 has finished for PR 2854 at commit
|
Test FAILed. |
Since it's not obvious what's failing, I guess I'll have to log into Jenkins and look at the logs. |
retest this please |
Test build #25180 has started for PR 2854 at commit
|
Test build #25180 has finished for PR 2854 at commit
|
Test FAILed. |
branch-0.9 in general seems to be failing tests because of port contention. I will open a PR to disable the SparkUI during tests to fix this. |
Ok I just fixed the port contention and python issues so tests should pass now. Let's retest this please. |
Test build #25331 has started for PR 2854 at commit
|
Test build #25331 has finished for PR 2854 at commit
|
Test PASSed. |
Finally. I'm merging this thanks everyone and @davies who fixed the python tests. :) |
This issue affects all versions since 0.7 up to (including) 1.1 In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. https://issues.apache.org/jira/browse/SPARK-4006 Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2854 from tsliwowicz/branch-0.9.2-block-mgr-removal and squashes the following commits: 95ae4db [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe... 81d69f0 [Tal Sliwowicz] fixed comment efd93f2 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.
Hi @tsliwowicz can you close this PR now that it's merged? Thanks. |
Mind closing this @tsliwowicz ? It won't auto-close since it was not opened against master. |
Mind closing this PR? |
This commit exists to close a pull request on github.
@tsliwowicz can you please close this pull request? |
@pwendell Done |
This issue affects all versions since 0.7 up to (including) 1.1
In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.
However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.
The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.
Also - added some logging for register and unregister.
https://issues.apache.org/jira/browse/SPARK-4006