-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12062] [CORE] Change Master to asyc rebuild UI when application completes #10284
[SPARK-12062] [CORE] Change Master to asyc rebuild UI when application completes #10284
Conversation
I tested this by making an artificially large event log file, then tried to stop the worker and make sure it could re-register before rebuilding the UI was completed, which worked fine for me on a local cluster at least. |
Test build #47630 has finished for PR 10284 at commit
|
Jenkins, retest this please |
Test build #47631 has finished for PR 10284 at commit
|
@@ -78,7 +88,7 @@ private[deploy] class Master( | |||
private val addressToApp = new HashMap[RpcAddress, ApplicationInfo] | |||
private val completedApps = new ArrayBuffer[ApplicationInfo] | |||
private var nextAppNumber = 0 | |||
private val appIdToUI = new HashMap[String, SparkUI] | |||
private val appIdToUI = new ConcurrentHashMap[String, SparkUI] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't have to change if you move more things under case AttachCompletedRebuildUI
because everything is synchronous there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hesitant to do this because that would mean adding the SparkUI
to a message and sending the object through the RPC layer, and I wasn't sure if it would be copied or serialized there. If the event logs are large then that would be a lot of copying. Is this not a problem since it's only being sent to itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that's a valid point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should at least a comment explaining why this needs to be a ConcurrentHashMap
@BryanCutler thanks for fixing this. Even though the bigger feature will be removed eventually it's only going to happen in future branches, not branch-1.6, which could still use this fix. I think the high level approach is sound. I made some suggestions on simplifying the code a little. |
Thanks for the quick feedback @andrewor14! I simplified things with your suggestions, I'm still looking into removing the |
Test build #47701 has finished for PR 10284 at commit
|
Hi @andrewor14, it looks like the default RPC However, from what I can tell Akka will serialize the message so if someone configured the master RPC to be I'd be happy to take the task of removing all of this in SPARK-12299 since I'm pretty familiar with the code now. That way I can make sure that if the |
LGTM. This is merge-able as is, though it would be best if we document why a |
Test build #47755 has finished for PR 10284 at commit
|
Merging into master and 1.6. Thanks for your work @BryanCutler. Feel free to start work on SPARK-12299 any time to remove all the work done here. :) That one I'll only merge in master. |
… completes This change builds the event history of completed apps asynchronously so the RPC thread will not be blocked and allow new workers to register/remove if the event log history is very large and takes a long time to rebuild. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #10284 from BryanCutler/async-MasterUI-SPARK-12062. (cherry picked from commit c5b6b39) Signed-off-by: Andrew Or <andrew@databricks.com>
I think the following exception seen in unit test run was related to this PR:
|
@tedyu would you mind pointing me to the jenkins page? |
I've opened #10417 to fix it. |
``` [info] ReplayListenerSuite: [info] - Simple replay (58 milliseconds) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull This was introduced in apache#10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests). Tested locally to verify that the NPE is gone. Author: Andrew Or <andrew@databricks.com> Closes apache#10417 from andrewor14/fix-harmless-npe.
``` [info] ReplayListenerSuite: [info] - Simple replay (58 milliseconds) java.lang.NullPointerException at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) ``` https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests). Tested locally to verify that the NPE is gone. Author: Andrew Or <andrew@databricks.com> Closes #10417 from andrewor14/fix-harmless-npe. (cherry picked from commit d655d37) Signed-off-by: Andrew Or <andrew@databricks.com>
This change builds the event history of completed apps asynchronously so the RPC thread will not be blocked and allow new workers to register/remove if the event log history is very large and takes a long time to rebuild.