[SPARK-1685] Cancel retryTimer on restart of Worker or AppClient #602

markhamstra · 2014-04-30T20:12:13Z

See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up.

AmplabJenkins · 2014-04-30T20:12:57Z

Merged build triggered.

AmplabJenkins · 2014-04-30T20:13:04Z

Merged build started.

aarondav · 2014-04-30T20:26:59Z

core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala

@@ -60,6 +60,7 @@ private[spark] class AppClient(
    var master: ActorSelection = null
    var alreadyDisconnected = false  // To avoid calling listener.disconnected() multiple times
    var alreadyDead = false  // To avoid calling listener.dead() multiple times
+    var retryTimer: Option[Cancellable] = None


Maybe we could rename this to something like "registrationRetryTimer" since wider scope makes this unclear otherwise.

Good idea. I'll update.

aarondav · 2014-04-30T20:27:46Z

One minor comment, other than that this looks good to me. Nice catch! I never know if our actors are actually restartable (or when they restart, for that matter)...

markhamstra · 2014-04-30T20:34:20Z

+1 On the "Is it really restartable?" comment. I spent a little time today looking and trying to answer that question for all of our actors. I had to quit when I got too scared! I'm pretty confident about the DAGScheduler post-#186, but for the rest, not so much.

AmplabJenkins · 2014-04-30T20:42:57Z

Merged build triggered.

AmplabJenkins · 2014-04-30T20:43:04Z

Merged build started.

AmplabJenkins · 2014-04-30T20:51:47Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-30T20:51:47Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14590/

AmplabJenkins · 2014-04-30T21:21:23Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-30T21:21:23Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14591/

mateiz · 2014-05-06T19:54:03Z

Merged this, thanks.

See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up. Author: Mark Hamstra <markhamstra@gmail.com> Closes #602 from markhamstra/SPARK-1685 and squashes the following commits: 11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer 69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient

See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up. Author: Mark Hamstra <markhamstra@gmail.com> Closes #602 from markhamstra/SPARK-1685 and squashes the following commits: 11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer 69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient (cherry picked from commit fbfe69d) Signed-off-by: Matei Zaharia <matei@databricks.com>

See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up. Author: Mark Hamstra <markhamstra@gmail.com> Closes apache#602 from markhamstra/SPARK-1685 and squashes the following commits: 11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer 69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient Conflicts: core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala

See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up. Author: Mark Hamstra <markhamstra@gmail.com> Closes apache#602 from markhamstra/SPARK-1685 and squashes the following commits: 11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer 69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient

* K8S-1087 - mount metrics_ticket implicitly to spark pods from mapr-server-secrets * K8S-1087 - fix tickets mounting conflict - move unnecessary config values to constants

Cancel retryTimer on restart of Worker or AppClient

69c348c

aarondav reviewed Apr 30, 2014
View reviewed changes

retryTimer -> registrationRetryTimer

11cc088

asfgit closed this in fbfe69d May 6, 2014

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Add citynetowrk fra region (apache#602)

c14fd9c

udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024

K8S-1087 (apache#602)

7f35c27

* K8S-1087 - mount metrics_ticket implicitly to spark pods from mapr-server-secrets * K8S-1087 - fix tickets mounting conflict - move unnecessary config values to constants

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-1685] Cancel retryTimer on restart of Worker or AppClient #602

[SPARK-1685] Cancel retryTimer on restart of Worker or AppClient #602

markhamstra commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

aarondav Apr 30, 2014

markhamstra Apr 30, 2014

aarondav commented Apr 30, 2014

markhamstra commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

mateiz commented May 6, 2014

[SPARK-1685] Cancel retryTimer on restart of Worker or AppClient #602

[SPARK-1685] Cancel retryTimer on restart of Worker or AppClient #602

Conversation

markhamstra commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

aarondav Apr 30, 2014

Choose a reason for hiding this comment

markhamstra Apr 30, 2014

Choose a reason for hiding this comment

aarondav commented Apr 30, 2014

markhamstra commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

AmplabJenkins commented Apr 30, 2014

mateiz commented May 6, 2014