Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1685] Cancel retryTimer on restart of Worker or AppClient #602

Closed
wants to merge 2 commits into from

Conversation

markhamstra
Copy link
Contributor

See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@@ -60,6 +60,7 @@ private[spark] class AppClient(
var master: ActorSelection = null
var alreadyDisconnected = false // To avoid calling listener.disconnected() multiple times
var alreadyDead = false // To avoid calling listener.dead() multiple times
var retryTimer: Option[Cancellable] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could rename this to something like "registrationRetryTimer" since wider scope makes this unclear otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll update.

@aarondav
Copy link
Contributor

One minor comment, other than that this looks good to me. Nice catch! I never know if our actors are actually restartable (or when they restart, for that matter)...

@markhamstra
Copy link
Contributor Author

+1 On the "Is it really restartable?" comment. I spent a little time today looking and trying to answer that question for all of our actors. I had to quit when I got too scared! I'm pretty confident about the DAGScheduler post-#186, but for the rest, not so much.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14590/

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14591/

@mateiz
Copy link
Contributor

mateiz commented May 6, 2014

Merged this, thanks.

@asfgit asfgit closed this in fbfe69d May 6, 2014
asfgit pushed a commit that referenced this pull request May 6, 2014
See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up.

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #602 from markhamstra/SPARK-1685 and squashes the following commits:

11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer
69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient
asfgit pushed a commit that referenced this pull request May 6, 2014
See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up.

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #602 from markhamstra/SPARK-1685 and squashes the following commits:

11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer
69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient

(cherry picked from commit fbfe69d)
Signed-off-by: Matei Zaharia <matei@databricks.com>
markhamstra added a commit to markhamstra/spark that referenced this pull request May 14, 2014
See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up.

Author: Mark Hamstra <markhamstra@gmail.com>

Closes apache#602 from markhamstra/SPARK-1685 and squashes the following commits:

11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer
69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient

Conflicts:
	core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala
	core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up.

Author: Mark Hamstra <markhamstra@gmail.com>

Closes apache#602 from markhamstra/SPARK-1685 and squashes the following commits:

11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer
69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
* K8S-1087

- mount metrics_ticket implicitly to spark pods from mapr-server-secrets

* K8S-1087
- fix tickets mounting conflict
- move unnecessary config values to constants
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
* K8S-1087

- mount metrics_ticket implicitly to spark pods from mapr-server-secrets

* K8S-1087
- fix tickets mounting conflict
- move unnecessary config values to constants
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants