[BEAM-313] Enable the use of an existing spark context with the SparkPipelineRunner #401

amarouni · 2016-05-31T13:43:11Z

The general use case is that the SparkPipelineRunner creates its own Spark context and uses it for the pipeline execution.
Another alternative is to provide the SparkPipelineRunner with an existing spark context. This can be interesting for a lot of use cases where the Spark context is managed outside of beam (context reuse, advanced context management, spark job server, ...).

iemejia · 2016-05-31T13:55:57Z

runners/spark/src/main/java/org/apache/beam/runners/spark/ExistingSparkContextOptions.java

+   */
+    @Description("Set to true if the spark runner will be "
+     + "initialized with an existing Spark Context")
+    boolean isProvidedJavaSparkContext();


I wouldn't create this as a boolean option. but infer it from the JavaSparkContext presence if the user passes it. If he passes the context is because he wants to use it. And I would put the JavaSparkContext as an option in the SparkPipelineOptions object.

jbonofre · 2016-05-31T14:20:58Z

Why not just creating an extend on SparkPipelineOptions injecting the SparkContext ?

jbonofre · 2016-05-31T14:22:03Z

R: @jbonofre

amarouni · 2016-05-31T15:11:44Z

@iemejia @jbonofre
I tried extending the SparkPipelineOptions by adding the get/setSparkJavaContext() but kept getting PipelineOptions serialization issues that's why I decided to store the Spark Context in the runner and not in the PipelineOptions.
Maybe you have another workaround.

iemejia · 2016-05-31T15:26:57Z

I thought that there could be issues with serialization, but I crossed my fingers it was not the case mmmm... this can be trickier then.

jbonofre · 2016-05-31T16:06:58Z

@amarouni good point.

iemejia · 2016-05-31T16:19:59Z

I have to be honest, passing the SparkContext seems a bit hacky to me, since I think the Spark Runner should have the responsability for the runtime of their SparkContext(s). Another option would be to change the runner so it instantiates a n number of contexts (1 by default) and reuses them if needed. What do you think about this other approach guys ?
Of course I assume we want to do this to reduce the initialization time of the contexts (for example).

jbonofre · 2016-05-31T16:29:20Z

The Spark runner is "executed" for each pipeline "execution". If you create three different pipelines (in a main method), then you will have three different contexts. It will be hard then to manage the contexts and pass from one execution to the other.
The runner is basically stateless.
Even if it's "weird", the context inject is probably easier to implement. The "context pooling" is tricky to implement.

amarouni · 2016-05-31T16:35:31Z

@iemejia The spark context should be seen as an additional configuration element for the Spark runner. This opens up the Spark Runner for more real use cases (Interactive Spark shell with beam, Spark job server, ...). Spark context pooling is hard to get right since we're limited to 1 Spark context per JVM.

jbonofre · 2016-05-31T16:43:15Z

@amarouni I agree, it makes more sense to me. Now, the main use case is that Beam bootstraps the spark job. But we can extend this.

amitsela · 2016-05-31T19:54:16Z

@amarouni this seems like a really interesting use case, but I have a few questions:

Why not just add isProvidedContext/setProvidedContext in SparkPipelineOption ?
If true, you should expect the context to be provided..

Did you check how this behaves with SparkSession in 2.0 ? because we're focusing our efforts towards 2.0 support to improve streaming support with it, so i'm not sure if this will still "fly". I still didn't get the chance to deep dive in there, but it's worth to understand it before we make this kind of a change.

jbonofre · 2016-05-31T19:58:57Z

@amitsela I also proposed to "inject" the context in the pipeline option but it seems that @amarouni got serialization exception. That's why he did the other way around.

Good point for the 2.0 SparkSession as it's the target for the future releases.

iemejia · 2016-05-31T20:00:05Z

I understand that the runner is stateless, but if I understood well, one of the advantages of passing the SparkContext is to make the initialization shorter (like spark-job-server does). Of course I know that the pooling part is way more complex but if the goal is to reduce the initialization cost it is one possible way to achieve it, however we lose the flexibility of all the magic things you can do with the context before you inject it, but this will resolve the serialization issue.

The other approach, the configuration (via the SparkRunner or the SparkOptions) is runner specific, my idea was more in the line of creating a somehow generic option to reuse pools of contexts in a more general and 'runner agnostic' way with a parameter like '--contexts 10'. But well maybe this does not make sense in other runners.

iemejia · 2016-05-31T20:06:20Z

@amitsela Do you have some 'public' branch for the ongoing work on the spark 2.0 integration that we can check/test ?

jbonofre · 2016-05-31T20:10:12Z

@iemejia of course: we are working here: https://github.com/amitsela/incubator-beam/tree/spark-runner-datasets

amitsela · 2016-05-31T20:12:03Z

@jbonofre this is the "cleaner" version, which will become the PR - https://github.com/amitsela/incubator-beam/commits/BEAM-198

amitsela · 2016-05-31T20:17:33Z

It's a bit stuck because of some serialization issues, but it seems like it's going to resolve in a few days, and I plan to get back to it next week.

But to the point of this PR - I didn't ask why the context is not a part of the options but rather why have more PipelineOptions implementations ? IF this is a feature that benefits the runner, it should be (a boolean) option in the SparkPiplineOptions.

About optimising on initialisation - I don't see a great benefit, it's a bootstrap, so it'll take another second..
What caught my eye was spark-shell 😲

This could be really cool, though I think this is a bit more complex...

amarouni · 2016-06-01T07:10:46Z

@amitsela I didn't get the chance to test with it Spark 2.0 sessions, but will do so as soon as I get some time. But is there any Beam spark 2.0 branch ?
Concerning the design I agree that I should have added the option to SparkPiplineOptions, I'll remove unnecessary classes/interfaces and stick to one additional option in SparkPiplineOptions.

amitsela · 2016-06-01T07:38:30Z

No Spark-runner 2.0 branch, but there is no Spark 2.0 release yet, there is a working progress as I mentioned above against Spark 1.6

Considering the use case, I don't think that optimising on context initialization overhead matters since it's a bootstrap - but please correct me if I'm missing something.
As for the spark-shell.... How did you think of doing something like that ? that sounds interesting!

amarouni · 2016-06-01T07:45:09Z

@amitsela Our core use-case is Beam on Spark Job Server (https://github.com/spark-jobserver/spark-jobserver). SJS creates/pools contexts, the contexts can also be setup with all necessary library dependencies and provided to something like Beam. That's how we got the idea of a SparkRunner with a provided context, and from there we saw the spark shell use case.

dhalperi · 2016-07-19T00:31:45Z

R: @amitsela

amitsela · 2016-08-02T19:43:40Z

runners/spark/src/test/java/org/apache/beam/runners/spark/SimpleWordCountTest.java

+
+    EvaluationResult res = SparkRunner.create(options, jsc).run(p);
+    res.close();
+    jsc.stop();


This should be implicit, probably to be handled by res.close() internally.

amitsela · 2016-08-02T19:51:55Z

@amarouni I owe you an apology.. This somehow slipped by me until I got a kind reminder from a friend.
I've added a note about closing the context.
The case for using the spark-shell is very interesting, especially with Scio coming, but in the case of running in shell, the context should not be stopped, right ? if so, how do you propose we handle a provided context ? we know we're using a provided context, but how do we know if it's the job server, shell, or something else ?

jbonofre · 2016-08-02T19:53:26Z

We just discussed with Amit about having two different spark runners (one for Spark 1.x, another one for Spark 2.x). So, this PR could be merged on the Spark 1.x runner.

amarouni · 2016-08-22T12:00:35Z

@amitsela @jbonofre
Thanks Amit and no problem !
I've just came back from vacation so sorry for this late response.

I think that the SparkRunner should just use the context as provided at the moment the Runner is created. The context stays external to the runner along with its state (running, stopped, ...) (we can add a isStopped call just before calling the execution logic to make sure that the context is up & running and fail the execution otherwise). This way the SparkRunner doesn't have to deal with the context's state and it doesn't matter if the context is an SJS or a shell one.

amitsela · 2016-08-23T07:30:21Z

@amarouni sounds good to me, care to rebase and I'll give it a final look ?
Thanks!

amarouni · 2016-08-23T16:49:53Z

@amitsela Here's what it's going to look like (commit 11).
I messed up my branch, I'll try to push a more proper version tomorrow.

amitsela · 2016-08-24T09:57:44Z

runners/spark/src/main/java/org/apache/beam/runners/spark/SparkRunner.java

@@ -195,6 +195,11 @@ public EvaluationResult run(Pipeline pipeline) {
      JavaSparkContext jsc;
      if (mOptions.isProvidedJavaSparkContext() && this.providedJavaSparkContext != null) {
          LOG.info("Using a provided Spark Java Context.");
+        if (this.providedJavaSparkContext.sc().isStopped()){
+          LOG.error("The provided Spark context "
+                  + this.providedJavaSparkContext + " is stopped");


4 space indent.

amitsela · 2016-08-24T10:06:58Z

One thing I'm concerned about is potability - we're trying to avoid SparkRunner.create().run() in the pipeline, and use Pipeline.run() instead, which creates the PipelineRunner from the provided options.
This could prove tricky..
It's worth a shot to see if SparkPipelineOptions can accommodate the provided context, but please note it should be annotated with @JsonIgnore

amarouni · 2016-08-26T06:56:45Z

@amitsela @jbonofre Done, I've pushed (--force).

jbonofre · 2016-08-26T06:58:16Z

@amarouni Thanks, I gonna take a look.

amitsela · 2016-08-26T10:39:12Z

runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineOptions.java

@@ -44,4 +47,18 @@
  @Default.Long(1000)
  Long getBatchIntervalMillis();
  void setBatchIntervalMillis(Long batchInterval);
+
+  @Override
+  @Default.String("spark dataflow pipeline job")


This was removed a while ago since it is inherited from ApplicationNameOptions

amitsela · 2016-08-26T10:53:46Z

@amarouni I added some comments, I think there some merges of older versions found there way back in your PR, I've commented on that.
Besides my comments, looks fine, tests provide good coverage!
Thanks!

amitsela · 2016-08-26T13:17:49Z

LGTM. waiting for tests to pass. Thanks!

jbonofre · 2016-08-26T13:25:24Z

FYI, there's an issue on OSX machine for Travis CI. As Jenkins passed, I think we are good to go.

LGTM

jbonofre · 2016-08-26T13:36:31Z

@amarouni note for future contribution: it's better to keep your commits in the head log using git rebase. It's largely easier to squash.

amarouni · 2016-08-26T14:09:21Z

@jbonofre will do.
In this case a merge was easier since I had different versions and the PR was open long time ago.
In any case thanks accepting it and looking forward for more contributions.

jbonofre · 2016-08-26T14:25:37Z

No problem, thanks for your contribution !

jbonofre · 2016-08-29T11:37:58Z

Thanks for the rebase and squash. Testing/merging now.

BigQuery has changed their total size quota to 12 TiB. https://cloud.google.com/bigquery/quota-policy#import

This closes apache#401

kohlerm · 2018-08-23T14:50:03Z

@amitsela does this mean Beam works now with the Spark job server? The beam documentation even mentions the Spark job server, but it's not clear to me how it would work.

amitsela · 2018-08-23T14:56:32Z

This PR (at the time), allowed to pass an existing SparkContext with the Spark runner (instead of the runner creating one exclusively), thus enabling integration with the Spark job server.
cc: @jbonofre @iemejia

iemejia · 2018-08-23T16:00:30Z

@kohlerm One use of this is to improve the latency of starting Spark jobs by 'pre-heating' a SparkContext with the required environment and reusing these contexts on job executions at SJS (reusing contexts is something you should do with caution).

kohlerm · 2018-08-27T08:02:38Z

Thanks for the quick replies!
@iemejia @jbonofre @amitsela
All I want to do is to use Beam on top of spark and use the SJS to start it, because the Spark cluster I have access to only allows starting jobs with SJS.
Is there a code (Java) snippet somewhere , how to do this (if it's possible at all)?
Sorry for my ignorance, but I'm neither a Spark nor a Beam expert yet.

amarouni · 2018-08-27T11:39:33Z

@kohlerm

Here's an old Scala snippet that shows how to use Beam & SJS, it's based on old versions of SJS & Beam so it probably won't compile with new SJS/Beam versions but you'll get the idea :

import com.typesafe.config.Config
import org.apache.beam.runners.spark.{ SparkContextOptions, SparkRunner }
import org.apache.beam.sdk.Pipeline
import org.apache.beam.sdk.coders.StringUtf8Coder
import org.apache.beam.sdk.options.PipelineOptionsFactory
import org.apache.beam.sdk.transforms.Create
import org.apache.spark.SparkContext
import org.apache.spark.api.java.JavaSparkContext
import spark.jobserver.{ SparkJob, SparkJobInvalid, SparkJobValid, SparkJobValidation }

import scala.collection.JavaConversions
import scala.util.Try

/**
 * Beam wordcount test. Returns the word count of a fixed String seq.
 */
object BeamWordCount extends SparkJob {
  override def validate(sc: SparkContext, config: Config): SparkJobValidation = {

    Try(config.getStringList("wordList"))
      .map(x => SparkJobValid)
      .getOrElse(SparkJobInvalid("No wordList in context config"))
  }

  override def runJob(sc: SparkContext, jobConfig: Config): Any = {

    // Input test list
    val inputBuffer = scala.collection.JavaConversions.asScalaBuffer(jobConfig.getStringList("wordList"))
    val WORDS = inputBuffer.toList

    // Pipeline options
    val sparkPipelineOptions = PipelineOptionsFactory.as(classOf[SparkContextOptions])
    sparkPipelineOptions.setAppName("Beam WordCount test")
    sparkPipelineOptions.setRunner(classOf[SparkRunner])
    sparkPipelineOptions.setUsesProvidedSparkContext(true)
    sparkPipelineOptions.setProvidedSparkContext(new JavaSparkContext(sc))

    // Pipeline
    val pipeline = Pipeline.create(sparkPipelineOptions)

    // Input + processing + Output
    val output = pipeline
      .apply(Create.of(JavaConversions.seqAsJavaList(WORDS)).withCoder(StringUtf8Coder.of()))
      .apply(new CountWords())

    // Result
    // val result: EvaluationResult = pipeline.run().asInstanceOf[EvaluationResult]

    // Run job & wait until finish
    pipeline.run().waitUntilFinish()
  }
}

It'd be nice to contribute this as new documentation if you manage to get it to work.

kohlerm · 2018-08-27T12:00:21Z

Thanks a lot! Of course in case we get it work I would try to document it. I think this is a pretty common use case.

kohlerm · 2018-08-31T11:12:08Z

https://gist.github.com/kohlerm/649f92195e71697a1931f81def176ec9

worked for us.
Thanks for the hint!

Source-Link: googleapis/synthtool@52e4e46 Post-Processor: gcr.io/repo-automation-bots/owlbot-python:latest@sha256:6186535cbdbf6b9fe61f00294929221d060634dae4a0795c1cefdbc995b2d605

iemejia reviewed May 31, 2016
View reviewed changes

amitsela reviewed Aug 2, 2016
View reviewed changes

amitsela reviewed Aug 24, 2016
View reviewed changes

amitsela reviewed Aug 26, 2016
View reviewed changes

[BEAM-313] Provide a context for SparkRunner

017da7b

amarouni force-pushed the mycbeam313 branch from d5615a1 to 017da7b Compare August 29, 2016 11:30

asfgit merged commit 017da7b into apache:master Aug 29, 2016

asfgit pushed a commit that referenced this pull request Aug 29, 2016

[BEAM-313] This closes #401

3666c22

amarouni deleted the mycbeam313 branch August 30, 2016 13:25

dhalperi added a commit to dhalperi/beam that referenced this pull request Sep 27, 2016

BigQueryIO.Write: raise size limit to 11 TiB (apache#401)

9c34b03

BigQuery has changed their total size quota to 12 TiB. https://cloud.google.com/bigquery/quota-policy#import

tvalentyn pushed a commit to tvalentyn/beam that referenced this pull request May 15, 2018

[BEAM-3852] Update slack channel link and slack invite for new channel.

04bf78e

This closes apache#401

[BEAM-313] Enable the use of an existing spark context with the SparkPipelineRunner #401

[BEAM-313] Enable the use of an existing spark context with the SparkPipelineRunner #401

Conversation

amarouni commented May 31, 2016

iemejia May 31, 2016 • edited Loading

Choose a reason for hiding this comment

jbonofre commented May 31, 2016

jbonofre commented May 31, 2016

amarouni commented May 31, 2016

iemejia commented May 31, 2016

jbonofre commented May 31, 2016

iemejia commented May 31, 2016 • edited Loading

jbonofre commented May 31, 2016

amarouni commented May 31, 2016

jbonofre commented May 31, 2016

amitsela commented May 31, 2016

jbonofre commented May 31, 2016

iemejia commented May 31, 2016 • edited Loading

iemejia commented May 31, 2016

jbonofre commented May 31, 2016

amitsela commented May 31, 2016

amitsela commented May 31, 2016

amarouni commented Jun 1, 2016

amitsela commented Jun 1, 2016

amarouni commented Jun 1, 2016

dhalperi commented Jul 19, 2016

amitsela Aug 2, 2016

Choose a reason for hiding this comment

amitsela commented Aug 2, 2016

jbonofre commented Aug 2, 2016

amarouni commented Aug 22, 2016

amitsela commented Aug 23, 2016

amarouni commented Aug 23, 2016

amitsela Aug 24, 2016

Choose a reason for hiding this comment

amitsela commented Aug 24, 2016

amarouni commented Aug 26, 2016

jbonofre commented Aug 26, 2016

amitsela Aug 26, 2016

Choose a reason for hiding this comment

amitsela commented Aug 26, 2016

amitsela commented Aug 26, 2016

jbonofre commented Aug 26, 2016

jbonofre commented Aug 26, 2016

amarouni commented Aug 26, 2016

jbonofre commented Aug 26, 2016

jbonofre commented Aug 29, 2016

kohlerm commented Aug 23, 2018

amitsela commented Aug 23, 2018

iemejia commented Aug 23, 2018

kohlerm commented Aug 27, 2018 • edited Loading

amarouni commented Aug 27, 2018

kohlerm commented Aug 27, 2018

kohlerm commented Aug 31, 2018

iemejia May 31, 2016 •

edited

Loading

iemejia commented May 31, 2016 •

edited

Loading

iemejia commented May 31, 2016 •

edited

Loading

kohlerm commented Aug 27, 2018 •

edited

Loading