Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4006] In long running contexts, we encountered the situation of d... #2914

Closed

Conversation

tsliwowicz
Copy link
Contributor

...ouble registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2886 except it's on branch-1.0

…f double registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like apache/spark#2854 except it's on master

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes #2886 from tsliwowicz/master-block-mgr-removal and squashes the following commits:

094d508 [Tal Sliwowicz] some more white space change undone
41a2217 [Tal Sliwowicz] some more whitspaces change undone
7bcfc3d [Tal Sliwowicz] whitspaces fix
df9d98f [Tal Sliwowicz] Code review comments fixed
f48bce9 [Tal Sliwowicz] In long running contexts, we encountered the situation of double register without a remove in between. The cause for that is unknown, and assumed a temp network issue.

(cherry picked from commit 6b48522)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala

(cherry picked from commit d122236)

Conflicts:
	core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala
@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have started for PR 2914 at commit 1014493.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 23, 2014

QA tests have finished for PR 2914 at commit 1014493.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22086/
Test FAILed.

@tsliwowicz
Copy link
Contributor Author

there seems to be some technical issue with the build. (not a real failure with the pull request itself)

@srowen
Copy link
Member

srowen commented Oct 24, 2014

(Looks like you should close this in favor of your other PR. You don't need to reopen just to update a PR. Yes the test failure looks unrelated. You can just ask Jenkins to test again.)

@tsliwowicz
Copy link
Contributor Author

I was asked by @andrewor14 to open separate PRs because it does not merge cleanly. #2886 was approved and merged.

@tsliwowicz
Copy link
Contributor Author

@srowen I don't have a login to Jenkins so someone else needs to restart the build. Is there a way to get a login? I would gladly do it.

@andrewor14
Copy link
Contributor

You should be able to say "retest this please" and it'll trigger it

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22146 has started for PR 2914 at commit 1014493.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22146 has finished for PR 2914 at commit 1014493.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22146/
Test FAILed.

@andrewor14
Copy link
Contributor

There's an issue with Spark running tests on PRs opened against older branches (e.g. 1.0, 0.9). I will look into this shortly...

@tsliwowicz
Copy link
Contributor Author

@andrewor14 - thanks for your help!

@tsliwowicz
Copy link
Contributor Author

Hi @andrewor14 - can I help somehow? I see that the PRs were not yet merged into 0.9 and 1.0

@andrewor14
Copy link
Contributor

Yeah I'm a little swamped for the 1.2 release at the moment so I haven't had time to dig into the Jenkins issue for older branches. I will try to look into it later this week if possible.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Dec 4, 2014

Test build #24149 has started for PR 2914 at commit 1014493.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 5, 2014

Test build #24149 has finished for PR 2914 at commit 1014493.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24149/
Test FAILed.

@tsliwowicz
Copy link
Contributor Author

Seems like an issue with Jenkins

@andrewor14
Copy link
Contributor

retest this please

@JoshRosen
Copy link
Contributor

I think this might be an issue with the Jenkins pull request builder and pull requests that are opened against non-master backport branches. Once this latest test run fails, I can try to dig in and help diagnose what's going on.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24260 has started for PR 2914 at commit 1014493.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24260 has finished for PR 2914 at commit 1014493.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24260/
Test FAILed.

@andrewor14
Copy link
Contributor

@JoshRosen and I just fixed the test infra failure for older branches. Let's retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24265/
Test FAILed.

@shaneknapp
Copy link
Contributor

jenkins, test this please

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24267 has started for PR 2914 at commit 1014493.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 10, 2014

Test build #24267 has finished for PR 2914 at commit 1014493.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24267/
Test FAILed.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Dec 10, 2014

Test build #24280 has started for PR 2914 at commit 1014493.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 10, 2014

Test build #24280 has finished for PR 2914 at commit 1014493.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24280/
Test FAILed.

@andrewor14
Copy link
Contributor

Hey sorry @tsliwowicz for using your PRs as the battleground in fixing our builds against older branches. There aren't a lot of PRs opened against older branches so these tests aren't run in this context very often. So far I think all of these test failures have nothing to do with your patch so there is no action needed on your side. On our side, we'll keep investigating why the tests are failing all the time.

@tsliwowicz
Copy link
Contributor Author

No problem. Glad to help :-)

On Wed, Dec 10, 2014 at 4:44 AM, andrewor14 notifications@github.com
wrote:

Hey sorry @tsliwowicz https://github.com/tsliwowicz for using your PRs
as the battleground in fixing our builds against older branches. There
aren't a lot of PRs opened against older branches so these tests aren't run
in this context very often. So far I think all of these test failures have
nothing to do with your patch so there is no action needed on your side. On
our side, we'll keep investigating why the tests are failing all the time.


Reply to this email directly or view it on GitHub
#2914 (comment).

@andrewor14
Copy link
Contributor

Looks like the issue is that in our tests we use python 2.6, and this version cannot unpickle arrays properly by default. @davies will backport #2365 to branch-1.0 and then we can re-run the tests after that.

@andrewor14
Copy link
Contributor

The build fix PR is #3668

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Dec 16, 2014

Test build #24475 has started for PR 2914 at commit 1014493.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 16, 2014

Test build #24475 has finished for PR 2914 at commit 1014493.

  • This patch fails some tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24475/
Test FAILed.

@shaneknapp
Copy link
Contributor

jenkins, test this

@andrewor14
Copy link
Contributor

Now that the necessary back ports are in place. Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Dec 17, 2014

Test build #24552 has started for PR 2914 at commit 1014493.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 17, 2014

Test build #24552 has finished for PR 2914 at commit 1014493.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24552/
Test PASSed.

@andrewor14
Copy link
Contributor

Finally. I'm merging this into branch-1.0 thanks for your patience @tsliwowicz

asfgit pushed a commit that referenced this pull request Dec 18, 2014
…f d...

...ouble registe...

...r without a remove in between. The cause for that is unknown, and assumed a temp network issue.

However, since the second register is with a BlockManagerId on a different port, blockManagerInfo.contains() returns false, while blockManagerIdByExecutor returns Some. This inconsistency is caught in a conditional statement that does System.exit(1), which is a huge robustness issue for us.

The fix - simply remove the old id from both maps during register when this happens. We are mimicking the behavior of expireDeadHosts(), by doing local cleanup of the maps before trying to add new ones.

Also - added some logging for register and unregister.

This is just like #2886 except it's on branch-1.0

Author: Tal Sliwowicz <tal.s@taboola.com>

Closes #2914 from tsliwowicz/branch-1.0-block-mgr-removal and squashes the following commits:

1014493 [Tal Sliwowicz] [SPARK-4006] In long running contexts, we encountered the situation of double registe...
@tsliwowicz
Copy link
Contributor Author

hurray :-)

On Thu, Dec 18, 2014 at 12:13 AM, andrewor14 notifications@github.com
wrote:

Finally. I'm merging this into branch-1.0 thanks for your patience
@tsliwowicz https://github.com/tsliwowicz


Reply to this email directly or view it on GitHub
#2914 (comment).

@andrewor14
Copy link
Contributor

By the way can you close this now that is' merged? thanks

@tsliwowicz tsliwowicz closed this Dec 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants