[SPARK-2321] Stable pull-based progress / status API #2696

JoshRosen · 2014-10-07T21:48:43Z

This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see SPARK-2321). For now, I'd like to discuss the basic implementation, API names, and overall interface design. Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API.

Design goals:

Pull-based API
Usable from Java / Scala / Python (eventually, likely with a wrapper)
Can be extended to expose more information without introducing binary incompatibilities.
Returns immutable objects.
Don't leak any implementation details, preserving our freedom to change the implementation.

Implementation:

Add public methods (getJobInfo, getStageInfo) to SparkContext to allow status / progress information to be retrieved.
Add public interfaces (SparkJobInfo, SparkStageInfo) for our API return values. These interfaces consist entirely of Java-style getter methods. The interfaces are currently implemented in Java. I decided to explicitly separate the interface from its implementation (SparkJobInfoImpl, SparkStageInfoImpl) in order to prevent users from constructing these responses themselves.
-Allow an existing JobProgressListener to be used when constructing a live SparkUI. This allows us to re-use this listeners in the implementation of this status API. There are a few reasons why this listener re-use makes sense:
- The status API and web UI are guaranteed to show consistent information.
- These listeners are already well-tested.
- The same garbage-collection / information retention configurations can apply to both this API and the web UI.
Extend JobProgressListener to maintain jobId -> Job and stageId -> Stage mappings.

The progress API methods are implemented in a separate trait that's mixed into SparkContext. This helps to avoid SparkContext.scala from becoming larger and more difficult to read.

This change means that the listeners will always be registered, even if the web UI is disabled. The motivation for this change is to allow these listeners to be used when implementing a stable pull-based status / progress API for 1.2.

Some design goals here: - Hide constructors and implementations from users; only expose interfaces. - Return only immutable objects. - Ensure API will be usable from Java (e.g. only expose Array collections, since they're nicely usable from Scala and Java without having to do any implicit conversions on the Scala side or wrapping into Java-friendly types on the Java side).

SparkQA · 2014-10-07T21:54:40Z

QA tests have started for PR 2696 at commit 08cbec9.

This patch merges cleanly.

SparkQA · 2014-10-07T22:56:52Z

QA tests have finished for PR 2696 at commit 08cbec9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class JobUIData(

AmplabJenkins · 2014-10-07T22:56:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21407/Test PASSed.

- Java-style getX() method naming doesn't make sense for immutable objects that will never have any setters. - The "consistent snapshot of the entire job -> stage -> task mapping" semantics might be very expensive to implement for large jobs, so I've decided to remove chaining between SparkJobInfo and SparkStageInfo interfaces. Concretely, this means that you can't write something like job.stages()(0).name to get the name of the first stage in a job. Instead, you have to explicitly get the stage's ID from the job and then look up that stage using sc.getStageInfo(). This isn't to say that we can't implement methods like "getNumActiveStages" that reflect consistent state; the goal is mainly to avoid spending lots of time / memory to construct huge object graphs.

… of Spark.

This is a very rough WIP example; things should be considerably simpler once I port AsyncRDDActions to Java.

AmplabJenkins · 2014-10-09T01:22:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21504/Test FAILed.

rxin · 2014-10-13T20:49:47Z

core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala

@@ -132,6 +132,12 @@ class JavaSparkContext(val sc: SparkContext)
  /** Default min number of partitions for Hadoop RDDs when not given by user */
  def defaultMinPartitions: java.lang.Integer = sc.defaultMinPartitions

+  def getJobsIdsForGroup(jobGroup: String): Array[Int] = sc.getJobsIdsForGroup(jobGroup)
+
+  def getJobInfo(jobId: Int): SparkJobInfo = sc.getJobInfo(jobId).orNull


add java doc to these 3 methods and explain that this returns null when jobId doesn't exist.

I suppose that I could use a Guava Optional here, although we were bitten in the past by exposing this third-party type in our APIs. Returning null would be inconsistent with the use of Optional in other parts of the Java API (such as JavaSparkContext.getCheckpointDir()). On the other hand, we may be stuck with Optional since other code exposes it.

What do you think we should do here?

null is fine - and definitely don't depend on teh Optional type in Guava. I was merely suggesting adding more documentation about this API

vanzin · 2014-10-14T20:38:16Z

Hi @JoshRosen,

I think this is going in the right direction and is definitely an improvement over the status quo. I left some comments about semantics of certain parts of the API and some general suggestions.

One thing is that this PR seems to cover half of SPARK-2321 - the part about monitoring, but not the listener API. Perhaps that bug should be broken into separate sub-tasks?

JoshRosen · 2014-10-14T22:40:56Z

@vanzin Yeah, maybe we should split SPARK-2321; this PR is only concerned with progress monitoring; I'm not proposing any stabilization of the listener APIs here.

- Use enumeration for job status. - Remove StatusAPIImpl class; implement methods directly in SparkContext. - Fix typo in getJobIdsForGroup method name. - Add Javadoc and comment on nullability. - Rename numCompleteTasks -> numCompleted tasks - Perform more efficient filtering in jobIdsForGroup

SparkQA · 2014-10-15T00:19:43Z

QA tests have started for PR 2696 at commit 249ca16.

This patch merges cleanly.

SparkQA · 2014-10-15T01:31:49Z

QA tests have finished for PR 2696 at commit 249ca16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-15T01:31:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21755/
Test PASSed.

andrewor14 · 2014-10-16T23:26:01Z

core/src/main/java/org/apache/spark/ExecutionStatus.java

+
+package org.apache.spark;
+
+public enum ExecutionStatus {


This is really a JobExecutionStatus

SparkQA · 2014-10-21T23:04:42Z

QA tests have started for PR 2696 at commit 2707f98.

This patch merges cleanly.

JoshRosen · 2014-10-21T23:04:48Z

Alright, I've updated this based on the latest round of feedback. I've removed the WIP, since I think this is enough for a basic first-cut at this API; we can always expose more things via it in later commits.

SparkQA · 2014-10-22T00:14:53Z

QA tests have finished for PR 2696 at commit 2707f98.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkContext(config: SparkConf) extends SparkStatusAPI with Logging
- class JobUIData(
- public final class JavaStatusAPIDemo
- public static final class IdentityWithDelay<T> implements Function<T, T>

AmplabJenkins · 2014-10-22T00:14:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22013/
Test PASSed.

vanzin · 2014-10-22T18:42:06Z

core/src/main/java/org/apache/spark/JobExecutionStatus.java

+
+package org.apache.spark;
+
+public enum JobExecutionStatus {


Feels like it's missing KILLED, although I'm not sure whether it's possible to differentiate that from FAILED in the backend at the moment.

I don't think that we have backend support for this, at least not at the JobProgressListener layer.

vanzin · 2014-10-22T18:59:26Z

@JoshRosen lgtm; I don't really like the listener changes in SparkUI (aside from the job one which is necessary), but can't really think of a much better approach (other than the previous one being just a ugly :-)).

Also, what about unit tests for the new APIs? I don't see any...

SparkQA · 2014-10-24T21:59:58Z

Test build #22171 has started for PR 2696 at commit e6aa78d.

This patch merges cleanly.

JoshRosen · 2014-10-24T22:05:37Z

I added a couple of simple tests. One subtlety here is the fact that this API is sort of "eventually consistent" in the sense that we might know about a job before we have any information on its stages. Offering stronger guarantees, like "once you have a JobInfo for a job that's running, you should always be able to get a StageInfo", may require significant scheduler / listener refactoring.

SparkQA · 2014-10-24T22:56:37Z

Test build #22171 has finished for PR 2696 at commit e6aa78d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkContext(config: SparkConf) extends SparkStatusAPI with Logging
- class JobUIData(
- public final class JavaStatusAPIDemo
- public static final class IdentityWithDelay<T> implements Function<T, T>

AmplabJenkins · 2014-10-24T22:56:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22171/
Test FAILed.

vanzin · 2014-10-24T23:04:36Z

LGTM. Thanks for adding the tests!

JoshRosen · 2014-10-25T07:06:38Z

Thanks for all of the review feedback! I'm going to merge this now so that we can begin using it in other PRs and features, such as progress bars in the Spark shells.

This PR refactors / extends the status API introduced in #2696. - Change StatusAPI from a mixin trait to a class. Before, the new status API methods were directly accessible through SparkContext, whereas now they're accessed through a `sc.statusAPI` field. As long as we were going to add these methods directly to SparkContext, the mixin trait seemed like a good idea, but this might be simpler to reason about and may avoid pitfalls that I've run into while attempting to refactor other parts of SparkContext to use mixins (see #3071, for example). - Change the name from SparkStatusAPI to SparkStatusTracker. - Make `getJobIdsForGroup(null)` return ids for jobs that aren't associated with any job group. - Add `getActiveStageIds()` and `getActiveJobIds()` methods that return the ids of whatever's currently active in this SparkContext. This should simplify davies's progress bar code. Author: Josh Rosen <joshrosen@databricks.com> Closes #3197 from JoshRosen/progress-api-improvements and squashes the following commits: 30b0afa [Josh Rosen] Rename SparkStatusAPI to SparkStatusTracker. d1b08d8 [Josh Rosen] Add missing newlines 2cc7353 [Josh Rosen] Add missing file. d5eab1f [Josh Rosen] Add getActive[Stage|Job]Ids() methods. a227984 [Josh Rosen] getJobIdsForGroup(null) should return jobs for default group c47e294 [Josh Rosen] Remove StatusAPI mixin trait. (cherry picked from commit 40eb8b6) Signed-off-by: Reynold Xin <rxin@databricks.com>

This PR refactors / extends the status API introduced in #2696. - Change StatusAPI from a mixin trait to a class. Before, the new status API methods were directly accessible through SparkContext, whereas now they're accessed through a `sc.statusAPI` field. As long as we were going to add these methods directly to SparkContext, the mixin trait seemed like a good idea, but this might be simpler to reason about and may avoid pitfalls that I've run into while attempting to refactor other parts of SparkContext to use mixins (see #3071, for example). - Change the name from SparkStatusAPI to SparkStatusTracker. - Make `getJobIdsForGroup(null)` return ids for jobs that aren't associated with any job group. - Add `getActiveStageIds()` and `getActiveJobIds()` methods that return the ids of whatever's currently active in this SparkContext. This should simplify davies's progress bar code. Author: Josh Rosen <joshrosen@databricks.com> Closes #3197 from JoshRosen/progress-api-improvements and squashes the following commits: 30b0afa [Josh Rosen] Rename SparkStatusAPI to SparkStatusTracker. d1b08d8 [Josh Rosen] Add missing newlines 2cc7353 [Josh Rosen] Add missing file. d5eab1f [Josh Rosen] Add getActive[Stage|Job]Ids() methods. a227984 [Josh Rosen] getJobIdsForGroup(null) should return jobs for default group c47e294 [Josh Rosen] Remove StatusAPI mixin trait.

JoshRosen added 3 commits October 6, 2014 21:02

Add jobId->stage, stageId->stage mappings in JobProgressListener

ac2d13a

JoshRosen added 4 commits October 8, 2014 16:47

Add note explaining that interfaces should not be implemented outside…

cc568e5

… of Spark.

Add getJobIdsForGroup() and num*Tasks() methods.

7319ffd

Add example of basic progress reporting in Java.

da5648e

This is a very rough WIP example; things should be considerably simpler once I port AsyncRDDActions to Java.

JoshRosen mentioned this pull request Oct 13, 2014

SPARK-3874, Provide stable TaskContext API #2782

Closed

rxin reviewed Oct 13, 2014
View reviewed changes

andrewor14 reviewed Oct 16, 2014
View reviewed changes

Expose current stage attempt id

2707f98

JoshRosen changed the title ~~[SPARK-2321] [WIP] Stable pull-based progress / status API~~ [SPARK-2321] Stable pull-based progress / status API Oct 21, 2014

vanzin reviewed Oct 22, 2014
View reviewed changes

JoshRosen added 3 commits October 24, 2014 13:21

Address review comments.

c96402d

Accept SparkListenerBus instead of more specific subclasses.

b585c16

Add tests.

e6aa78d

asfgit closed this in 9530316 Oct 25, 2014

JoshRosen mentioned this pull request Nov 11, 2014

[SPARK-2321] Several progress API improvements / refactorings #3197

Closed

squito mentioned this pull request Jan 26, 2015

[SPARK-3454] Expose JSON representation of data shown in WebU #2333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2321] Stable pull-based progress / status API #2696

[SPARK-2321] Stable pull-based progress / status API #2696

JoshRosen commented Oct 7, 2014

SparkQA commented Oct 7, 2014

SparkQA commented Oct 7, 2014

AmplabJenkins commented Oct 7, 2014

AmplabJenkins commented Oct 9, 2014

rxin Oct 13, 2014

JoshRosen Oct 14, 2014

rxin Oct 14, 2014

vanzin commented Oct 14, 2014

JoshRosen commented Oct 14, 2014

SparkQA commented Oct 15, 2014

SparkQA commented Oct 15, 2014

AmplabJenkins commented Oct 15, 2014

andrewor14 Oct 16, 2014

SparkQA commented Oct 21, 2014

JoshRosen commented Oct 21, 2014

SparkQA commented Oct 22, 2014

AmplabJenkins commented Oct 22, 2014

vanzin Oct 22, 2014

JoshRosen Oct 24, 2014

vanzin commented Oct 22, 2014

SparkQA commented Oct 24, 2014

JoshRosen commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

vanzin commented Oct 24, 2014

JoshRosen commented Oct 25, 2014

[SPARK-2321] Stable pull-based progress / status API #2696

[SPARK-2321] Stable pull-based progress / status API #2696

Conversation

JoshRosen commented Oct 7, 2014

Design goals:

Implementation:

SparkQA commented Oct 7, 2014

SparkQA commented Oct 7, 2014

AmplabJenkins commented Oct 7, 2014

AmplabJenkins commented Oct 9, 2014

rxin Oct 13, 2014

Choose a reason for hiding this comment

JoshRosen Oct 14, 2014

Choose a reason for hiding this comment

rxin Oct 14, 2014

Choose a reason for hiding this comment

vanzin commented Oct 14, 2014

JoshRosen commented Oct 14, 2014

SparkQA commented Oct 15, 2014

SparkQA commented Oct 15, 2014

AmplabJenkins commented Oct 15, 2014

andrewor14 Oct 16, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 21, 2014

JoshRosen commented Oct 21, 2014

SparkQA commented Oct 22, 2014

AmplabJenkins commented Oct 22, 2014

vanzin Oct 22, 2014

Choose a reason for hiding this comment

JoshRosen Oct 24, 2014

Choose a reason for hiding this comment

vanzin commented Oct 22, 2014

SparkQA commented Oct 24, 2014

JoshRosen commented Oct 24, 2014

SparkQA commented Oct 24, 2014

AmplabJenkins commented Oct 24, 2014

vanzin commented Oct 24, 2014

JoshRosen commented Oct 25, 2014