-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4006] In long running contexts, we encountered the situation of d... #2914
Conversation
…f double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like apache/spark#2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. (cherry picked from commit 6b48522) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala (cherry picked from commit d122236) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
QA tests have started for PR 2914 at commit
|
QA tests have finished for PR 2914 at commit
|
Test FAILed. |
there seems to be some technical issue with the build. (not a real failure with the pull request itself) |
(Looks like you should close this in favor of your other PR. You don't need to reopen just to update a PR. Yes the test failure looks unrelated. You can just ask Jenkins to test again.) |
I was asked by @andrewor14 to open separate PRs because it does not merge cleanly. #2886 was approved and merged. |
@srowen I don't have a login to Jenkins so someone else needs to restart the build. Is there a way to get a login? I would gladly do it. |
You should be able to say "retest this please" and it'll trigger it |
Test build #22146 has started for PR 2914 at commit
|
Test build #22146 has finished for PR 2914 at commit
|
Test FAILed. |
There's an issue with Spark running tests on PRs opened against older branches (e.g. 1.0, 0.9). I will look into this shortly... |
@andrewor14 - thanks for your help! |
Hi @andrewor14 - can I help somehow? I see that the PRs were not yet merged into 0.9 and 1.0 |
Yeah I'm a little swamped for the 1.2 release at the moment so I haven't had time to dig into the Jenkins issue for older branches. I will try to look into it later this week if possible. |
Jenkins, retest this please. |
Test build #24149 has started for PR 2914 at commit
|
Test build #24149 has finished for PR 2914 at commit
|
Test FAILed. |
Seems like an issue with Jenkins |
retest this please |
I think this might be an issue with the Jenkins pull request builder and pull requests that are opened against non-master backport branches. Once this latest test run fails, I can try to dig in and help diagnose what's going on. |
Test build #24260 has started for PR 2914 at commit
|
Test build #24260 has finished for PR 2914 at commit
|
Test FAILed. |
@JoshRosen and I just fixed the test infra failure for older branches. Let's retest this please |
Test FAILed. |
jenkins, test this please |
Test build #24267 has started for PR 2914 at commit
|
Test build #24267 has finished for PR 2914 at commit
|
Test FAILed. |
retest this please |
Test build #24280 has started for PR 2914 at commit
|
Test build #24280 has finished for PR 2914 at commit
|
Test FAILed. |
Hey sorry @tsliwowicz for using your PRs as the battleground in fixing our builds against older branches. There aren't a lot of PRs opened against older branches so these tests aren't run in this context very often. So far I think all of these test failures have nothing to do with your patch so there is no action needed on your side. On our side, we'll keep investigating why the tests are failing all the time. |
No problem. Glad to help :-) On Wed, Dec 10, 2014 at 4:44 AM, andrewor14 notifications@github.com
|
The build fix PR is #3668 |
Jenkins, retest this please. |
Test build #24475 has started for PR 2914 at commit
|
Test build #24475 has finished for PR 2914 at commit
|
Test FAILed. |
jenkins, test this |
Now that the necessary back ports are in place. Jenkins, test this please |
Test build #24552 has started for PR 2914 at commit
|
Test build #24552 has finished for PR 2914 at commit
|
Test PASSed. |
Finally. I'm merging this into branch-1.0 thanks for your patience @tsliwowicz |
…f d... ...ouble registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like #2886 except it's on branch-1.0 Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2914 from tsliwowicz/branch-1.0-block-mgr-removal and squashes the following commits: 1014493 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
hurray :-) On Thu, Dec 18, 2014 at 12:13 AM, andrewor14 notifications@github.com
|
By the way can you close this now that is' merged? thanks |
...ouble registe...
...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.
However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.
The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.
Also - added some logging for register and unregister.
This is just like #2886 except it's on branch-1.0