-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-1685] Cancel retryTimer on restart of Worker or AppClient #602
Conversation
Merged build triggered. |
Merged build started. |
@@ -60,6 +60,7 @@ private[spark] class AppClient( | |||
var master: ActorSelection = null | |||
var alreadyDisconnected = false // To avoid calling listener.disconnected() multiple times | |||
var alreadyDead = false // To avoid calling listener.dead() multiple times | |||
var retryTimer: Option[Cancellable] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could rename this to something like "registrationRetryTimer" since wider scope makes this unclear otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll update.
One minor comment, other than that this looks good to me. Nice catch! I never know if our actors are actually restartable (or when they restart, for that matter)... |
+1 On the "Is it really restartable?" comment. I spent a little time today looking and trying to answer that question for all of our actors. I had to quit when I got too scared! I'm pretty confident about the DAGScheduler post-#186, but for the rest, not so much. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Merged this, thanks. |
See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up. Author: Mark Hamstra <markhamstra@gmail.com> Closes #602 from markhamstra/SPARK-1685 and squashes the following commits: 11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer 69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient
See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up. Author: Mark Hamstra <markhamstra@gmail.com> Closes #602 from markhamstra/SPARK-1685 and squashes the following commits: 11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer 69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient (cherry picked from commit fbfe69d) Signed-off-by: Matei Zaharia <matei@databricks.com>
See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up. Author: Mark Hamstra <markhamstra@gmail.com> Closes apache#602 from markhamstra/SPARK-1685 and squashes the following commits: 11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer 69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient Conflicts: core/src/main/scala/org/apache/spark/deploy/client/AppClient.scala core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up. Author: Mark Hamstra <markhamstra@gmail.com> Closes apache#602 from markhamstra/SPARK-1685 and squashes the following commits: 11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer 69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient
* K8S-1087 - mount metrics_ticket implicitly to spark pods from mapr-server-secrets * K8S-1087 - fix tickets mounting conflict - move unnecessary config values to constants
* K8S-1087 - mount metrics_ticket implicitly to spark pods from mapr-server-secrets * K8S-1087 - fix tickets mounting conflict - move unnecessary config values to constants
See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up.