-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3923] Increase Akka heartbeat pause above heartbeat interval #2784
Conversation
Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests that heartbeat pause should be less than heartbeat interval, and decreasing the interval from 1000s to below the 600s of the heartbeat pause seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure! I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.
Test FAILed. |
Jenkins, retest this please. |
Test FAILed. |
Jenkins, retest this please. |
Test FAILed. |
QA tests have started for PR 2784 at commit
|
QA tests have finished for PR 2784 at commit
|
Jenkins, retest this please. |
QA tests have started for PR 2784 at commit
|
QA tests have started for PR 2784 at commit
|
QA tests have finished for PR 2784 at commit
|
Test FAILed. |
QA tests have finished for PR 2784 at commit
|
Test FAILed. |
QA tests have started for PR 2784 at commit
|
QA tests have finished for PR 2784 at commit
|
This configuration seems to be the value in milliseconds. |
Sorry, I made a mistake. |
Hey Aaron, You can actually increase the pause, until akka provides a property to completely turn this off. (I think we should log an issue ?) |
QA tests have started for PR 2784 at commit
|
QA tests have finished for PR 2784 at commit
|
Test PASSed. |
@ScrapCodes increased pause by order of magnitude and reverted change to interval |
Thanks, LGTM. |
Minor: Your PR title looks misleading ! :) |
Updated, but I think we should always give PRs a name opposite to what they actually do. Keeps things interesting. |
"above below"? |
I see, if a heartbeat is lost there is no way to recover if the wait time is less than the interval. With these changes the default pause is 6 times the default interval. This LGTM. I'm merging this. |
Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). This post suggests that heartbeat pause should be greater than heartbeat interval, and increasing the pause from 600s to 6000s seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure!
I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.