[SPARK-4006] In long running contexts, we encountered the situation of d... #2914

tsliwowicz · 2014-10-23T21:17:47Z

...ouble registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2886 except it's on branch-1.0

…f double registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like apache/spark#2854 except it's on master Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits: 094d508 [Tal Sliwowicz] some more white space change undone 41a2217 [Tal Sliwowicz] some more whitspaces change undone 7bcfc3d [Tal Sliwowicz] whitspaces fix df9d98f [Tal Sliwowicz] Code review comments fixed f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue. (cherry picked from commit 6b48522) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala (cherry picked from commit d122236) Conflicts: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala

SparkQA · 2014-10-23T21:24:58Z

QA tests have started for PR 2914 at commit 1014493.

This patch merges cleanly.

SparkQA · 2014-10-23T22:44:48Z

QA tests have finished for PR 2914 at commit 1014493.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-23T22:44:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22086/
Test FAILed.

tsliwowicz · 2014-10-24T13:56:32Z

there seems to be some technical issue with the build. (not a real failure with the pull request itself)

srowen · 2014-10-24T14:23:56Z

(Looks like you should close this in favor of your other PR. You don't need to reopen just to update a PR. Yes the test failure looks unrelated. You can just ask Jenkins to test again.)

tsliwowicz · 2014-10-24T16:18:40Z

I was asked by @andrewor14 to open separate PRs because it does not merge cleanly. #2886 was approved and merged.

tsliwowicz · 2014-10-24T16:20:42Z

@srowen I don't have a login to Jenkins so someone else needs to restart the build. Is there a way to get a login? I would gladly do it.

andrewor14 · 2014-10-24T17:22:35Z

You should be able to say "retest this please" and it'll trigger it

SparkQA · 2014-10-24T17:29:52Z

Test build #22146 has started for PR 2914 at commit 1014493.

This patch merges cleanly.

SparkQA · 2014-10-24T18:44:15Z

Test build #22146 has finished for PR 2914 at commit 1014493.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-24T18:44:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22146/
Test FAILed.

andrewor14 · 2014-10-24T20:54:00Z

There's an issue with Spark running tests on PRs opened against older branches (e.g. 1.0, 0.9). I will look into this shortly...

tsliwowicz · 2014-10-24T21:03:42Z

@andrewor14 - thanks for your help!

tsliwowicz · 2014-10-28T11:09:45Z

Hi @andrewor14 - can I help somehow? I see that the PRs were not yet merged into 0.9 and 1.0

andrewor14 · 2014-10-28T16:54:20Z

Yeah I'm a little swamped for the 1.2 release at the moment so I haven't had time to dig into the Jenkins issue for older branches. I will try to look into it later this week if possible.

JoshRosen · 2014-12-04T23:38:02Z

Jenkins, retest this please.

SparkQA · 2014-12-04T23:45:30Z

Test build #24149 has started for PR 2914 at commit 1014493.

This patch merges cleanly.

SparkQA · 2014-12-05T01:03:50Z

Test build #24149 has finished for PR 2914 at commit 1014493.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-05T01:03:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24149/
Test FAILed.

tsliwowicz · 2014-12-05T12:31:06Z

Seems like an issue with Jenkins

andrewor14 · 2014-12-09T21:16:03Z

retest this please

JoshRosen · 2014-12-09T21:17:24Z

I think this might be an issue with the Jenkins pull request builder and pull requests that are opened against non-master backport branches. Once this latest test run fails, I can try to dig in and help diagnose what's going on.

SparkQA · 2014-12-09T21:18:11Z

Test build #24260 has started for PR 2914 at commit 1014493.

This patch merges cleanly.

SparkQA · 2014-12-09T22:31:11Z

Test build #24260 has finished for PR 2914 at commit 1014493.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-09T22:31:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24260/
Test FAILed.

andrewor14 · 2014-12-09T22:49:58Z

@JoshRosen and I just fixed the test infra failure for older branches. Let's retest this please

AmplabJenkins · 2014-12-09T22:52:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24265/
Test FAILed.

shaneknapp · 2014-12-09T23:03:38Z

jenkins, test this please

SparkQA · 2014-12-09T23:10:26Z

Test build #24267 has started for PR 2914 at commit 1014493.

This patch merges cleanly.

SparkQA · 2014-12-10T00:22:04Z

Test build #24267 has finished for PR 2914 at commit 1014493.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-10T00:22:11Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24267/
Test FAILed.

andrewor14 · 2014-12-10T01:12:35Z

retest this please

SparkQA · 2014-12-10T01:21:02Z

Test build #24280 has started for PR 2914 at commit 1014493.

This patch merges cleanly.

SparkQA · 2014-12-10T02:34:56Z

Test build #24280 has finished for PR 2914 at commit 1014493.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-10T02:35:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24280/
Test FAILed.

andrewor14 · 2014-12-10T02:43:49Z

Hey sorry @tsliwowicz for using your PRs as the battleground in fixing our builds against older branches. There aren't a lot of PRs opened against older branches so these tests aren't run in this context very often. So far I think all of these test failures have nothing to do with your patch so there is no action needed on your side. On our side, we'll keep investigating why the tests are failing all the time.

tsliwowicz · 2014-12-10T12:44:32Z

No problem. Glad to help :-)

On Wed, Dec 10, 2014 at 4:44 AM, andrewor14 notifications@github.com
wrote:

Hey sorry @tsliwowicz https://github.com/tsliwowicz for using your PRs
as the battleground in fixing our builds against older branches. There
aren't a lot of PRs opened against older branches so these tests aren't run
in this context very often. So far I think all of these test failures have
nothing to do with your patch so there is no action needed on your side. On
our side, we'll keep investigating why the tests are failing all the time.

—
Reply to this email directly or view it on GitHub
#2914 (comment).

andrewor14 · 2014-12-10T20:07:34Z

Looks like the issue is that in our tests we use python 2.6, and this version cannot unpickle arrays properly by default. @davies will backport #2365 to branch-1.0 and then we can re-run the tests after that.

andrewor14 · 2014-12-10T22:57:52Z

The build fix PR is #3668

JoshRosen · 2014-12-16T01:03:36Z

Jenkins, retest this please.

SparkQA · 2014-12-16T01:07:38Z

Test build #24475 has started for PR 2914 at commit 1014493.

This patch merges cleanly.

SparkQA · 2014-12-16T01:29:35Z

Test build #24475 has finished for PR 2914 at commit 1014493.

This patch fails some tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-16T01:29:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24475/
Test FAILed.

shaneknapp · 2014-12-16T02:07:23Z

jenkins, test this

andrewor14 · 2014-12-17T19:59:34Z

Now that the necessary back ports are in place. Jenkins, test this please

SparkQA · 2014-12-17T20:02:44Z

Test build #24552 has started for PR 2914 at commit 1014493.

This patch merges cleanly.

SparkQA · 2014-12-17T21:28:25Z

Test build #24552 has finished for PR 2914 at commit 1014493.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-17T21:28:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24552/
Test PASSed.

andrewor14 · 2014-12-17T22:13:19Z

Finally. I'm merging this into branch-1.0 thanks for your patience @tsliwowicz

…f d... ...ouble registe... ...r without a remove in between. The cause for that is unknown, and assumed a temp network issue. However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us. The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones. Also - added some logging for register and unregister. This is just like #2886 except it's on branch-1.0 Author: Tal Sliwowicz <tal.s@taboola.com> Closes #2914 from tsliwowicz/branch-1.0-block-mgr-removal and squashes the following commits: 1014493 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...

tsliwowicz · 2014-12-18T07:51:39Z

hurray :-)

On Thu, Dec 18, 2014 at 12:13 AM, andrewor14 notifications@github.com
wrote:

Finally. I'm merging this into branch-1.0 thanks for your patience
@tsliwowicz https://github.com/tsliwowicz

—
Reply to this email directly or view it on GitHub
#2914 (comment).

andrewor14 · 2014-12-18T19:50:33Z

By the way can you close this now that is' merged? thanks

tsliwowicz closed this Dec 19, 2014

[SPARK-4006] In long running contexts, we encountered the situation of d... #2914

[SPARK-4006] In long running contexts, we encountered the situation of d... #2914

Conversation

tsliwowicz commented Oct 23, 2014

SparkQA commented Oct 23, 2014

SparkQA commented Oct 23, 2014

AmplabJenkins commented Oct 23, 2014

tsliwowicz commented Oct 24, 2014

srowen commented Oct 24, 2014

tsliwowicz commented Oct 24, 2014

tsliwowicz commented Oct 24, 2014

andrewor14 commented Oct 24, 2014

SparkQA commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

andrewor14 commented Oct 24, 2014

tsliwowicz commented Oct 24, 2014

tsliwowicz commented Oct 28, 2014

andrewor14 commented Oct 28, 2014

JoshRosen commented Dec 4, 2014

SparkQA commented Dec 4, 2014

SparkQA commented Dec 5, 2014

AmplabJenkins commented Dec 5, 2014

tsliwowicz commented Dec 5, 2014

andrewor14 commented Dec 9, 2014

JoshRosen commented Dec 9, 2014

SparkQA commented Dec 9, 2014

SparkQA commented Dec 9, 2014

AmplabJenkins commented Dec 9, 2014

andrewor14 commented Dec 9, 2014

AmplabJenkins commented Dec 9, 2014

shaneknapp commented Dec 9, 2014

SparkQA commented Dec 9, 2014

SparkQA commented Dec 10, 2014

AmplabJenkins commented Dec 10, 2014

andrewor14 commented Dec 10, 2014

SparkQA commented Dec 10, 2014

SparkQA commented Dec 10, 2014

AmplabJenkins commented Dec 10, 2014

andrewor14 commented Dec 10, 2014

tsliwowicz commented Dec 10, 2014

andrewor14 commented Dec 10, 2014

andrewor14 commented Dec 10, 2014

JoshRosen commented Dec 16, 2014

SparkQA commented Dec 16, 2014

SparkQA commented Dec 16, 2014

AmplabJenkins commented Dec 16, 2014

shaneknapp commented Dec 16, 2014

andrewor14 commented Dec 17, 2014

SparkQA commented Dec 17, 2014

SparkQA commented Dec 17, 2014

AmplabJenkins commented Dec 17, 2014

andrewor14 commented Dec 17, 2014

tsliwowicz commented Dec 18, 2014

andrewor14 commented Dec 18, 2014