[SPARK-3923] Increase Akka heartbeat pause above heartbeat interval #2784

aarondav · 2014-10-13T18:21:02Z

Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). This post suggests that heartbeat pause should be greater than heartbeat interval, and increasing the pause from 600s to 6000s seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure!

I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.

Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests that heartbeat pause should be less than heartbeat interval, and decreasing the interval from 1000s to below the 600s of the heartbeat pause seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure! I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.

AmplabJenkins · 2014-10-13T19:14:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amp.lab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21694/
Test FAILed.

aarondav · 2014-10-13T21:08:27Z

Jenkins, retest this please.

AmplabJenkins · 2014-10-13T21:22:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21702/
Test FAILed.

aarondav · 2014-10-13T21:42:06Z

Jenkins, retest this please.

AmplabJenkins · 2014-10-13T22:42:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21706/
Test FAILed.

SparkQA · 2014-10-14T00:46:45Z

QA tests have started for PR 2784 at commit 3639220.

This patch merges cleanly.

SparkQA · 2014-10-14T01:40:22Z

QA tests have finished for PR 2784 at commit 3639220.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

aarondav · 2014-10-14T01:41:23Z

Jenkins, retest this please.

SparkQA · 2014-10-14T01:44:52Z

QA tests have started for PR 2784 at commit 3639220.

This patch merges cleanly.

SparkQA · 2014-10-14T01:54:38Z

QA tests have started for PR 2784 at commit 9cb0372.

This patch merges cleanly.

SparkQA · 2014-10-14T02:36:40Z

QA tests have finished for PR 2784 at commit 3639220.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-14T02:36:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21711/
Test FAILed.

SparkQA · 2014-10-14T02:50:05Z

QA tests have finished for PR 2784 at commit 9cb0372.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-14T02:50:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21712/
Test FAILed.

SparkQA · 2014-10-14T03:03:29Z

QA tests have started for PR 2784 at commit 9cb0372.

This patch merges cleanly.

SparkQA · 2014-10-14T04:15:42Z

QA tests have finished for PR 2784 at commit 9cb0372.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

witgo · 2014-10-14T04:38:07Z

This configuration seems to be the value in milliseconds.
DeadlineFailureDetector.scala
PhiAccrualFailureDetector.scala

witgo · 2014-10-14T04:47:50Z

Sorry, I made a mistake.

ScrapCodes · 2014-10-14T09:33:11Z

Hey Aaron,
I increased the interval because its any way a "noise" !, We don't intend to use the akka's Failure Detector because we have our own heart beat tracking mechanism in place. If you reduce the time interval the number of System messages exchanged will rise. It may not be evident as in effect on performance or in perf benchmark etc, but these are unnecessary.

You can actually increase the pause, until akka provides a property to completely turn this off. (I think we should log an issue ?)

SparkQA · 2014-10-14T17:19:42Z

QA tests have started for PR 2784 at commit bd1151a.

This patch merges cleanly.

SparkQA · 2014-10-14T18:29:03Z

QA tests have finished for PR 2784 at commit bd1151a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-14T18:29:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21733/
Test PASSed.

aarondav · 2014-10-14T20:09:40Z

@ScrapCodes increased pause by order of magnitude and reverted change to interval

ScrapCodes · 2014-10-15T06:35:20Z

Thanks, LGTM.

ScrapCodes · 2014-10-15T06:40:12Z

Minor: Your PR title looks misleading ! :)

aarondav · 2014-10-15T16:50:26Z

Updated, but I think we should always give PRs a name opposite to what they actually do. Keeps things interesting.

vanzin · 2014-10-15T16:52:42Z

"above below"?

andrewor14 · 2014-10-17T01:57:18Z

I see, if a heartbeat is lost there is no way to recover if the wait time is less than the interval. With these changes the default pause is 6 times the default interval. This LGTM. I'm merging this.

aarondav force-pushed the fix-timeout branch from 3639220 to 9cb0372 Compare October 14, 2014 01:48

Increase pause, don't decrease interval

bd1151a

aarondav changed the title ~~[SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause~~ [SPARK-3923] Increase Akka heartbeat pause above below heartbeat interval Oct 15, 2014

aarondav changed the title ~~[SPARK-3923] Increase Akka heartbeat pause above below heartbeat interval~~ [SPARK-3923] Increase Akka heartbeat pause above heartbeat interval Oct 15, 2014

asfgit closed this in 7f7b50e Oct 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3923] Increase Akka heartbeat pause above heartbeat interval #2784

[SPARK-3923] Increase Akka heartbeat pause above heartbeat interval #2784

aarondav commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

aarondav commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

aarondav commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

SparkQA commented Oct 14, 2014

SparkQA commented Oct 14, 2014

aarondav commented Oct 14, 2014

SparkQA commented Oct 14, 2014

SparkQA commented Oct 14, 2014

SparkQA commented Oct 14, 2014

AmplabJenkins commented Oct 14, 2014

SparkQA commented Oct 14, 2014

AmplabJenkins commented Oct 14, 2014

SparkQA commented Oct 14, 2014

SparkQA commented Oct 14, 2014

witgo commented Oct 14, 2014

witgo commented Oct 14, 2014

ScrapCodes commented Oct 14, 2014

SparkQA commented Oct 14, 2014

SparkQA commented Oct 14, 2014

AmplabJenkins commented Oct 14, 2014

aarondav commented Oct 14, 2014

ScrapCodes commented Oct 15, 2014

ScrapCodes commented Oct 15, 2014

aarondav commented Oct 15, 2014

vanzin commented Oct 15, 2014

andrewor14 commented Oct 17, 2014

[SPARK-3923] Increase Akka heartbeat pause above heartbeat interval #2784

[SPARK-3923] Increase Akka heartbeat pause above heartbeat interval #2784

Conversation

aarondav commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

aarondav commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

aarondav commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

SparkQA commented Oct 14, 2014

SparkQA commented Oct 14, 2014

aarondav commented Oct 14, 2014

SparkQA commented Oct 14, 2014

SparkQA commented Oct 14, 2014

SparkQA commented Oct 14, 2014

AmplabJenkins commented Oct 14, 2014

SparkQA commented Oct 14, 2014

AmplabJenkins commented Oct 14, 2014

SparkQA commented Oct 14, 2014

SparkQA commented Oct 14, 2014

witgo commented Oct 14, 2014

witgo commented Oct 14, 2014

ScrapCodes commented Oct 14, 2014

SparkQA commented Oct 14, 2014

SparkQA commented Oct 14, 2014

AmplabJenkins commented Oct 14, 2014

aarondav commented Oct 14, 2014

ScrapCodes commented Oct 15, 2014

ScrapCodes commented Oct 15, 2014

aarondav commented Oct 15, 2014

vanzin commented Oct 15, 2014

andrewor14 commented Oct 17, 2014