Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-313] Enable the use of an existing spark context with the SparkPipelineRunner #401

Merged
merged 1 commit into from
Aug 29, 2016

Conversation

amarouni
Copy link
Contributor

The general use case is that the SparkPipelineRunner creates its own Spark context and uses it for the pipeline execution.
Another alternative is to provide the SparkPipelineRunner with an existing spark context. This can be interesting for a lot of use cases where the Spark context is managed outside of beam (context reuse, advanced context management, spark job server, ...).

*/
@Description("Set to true if the spark runner will be "
+ "initialized with an existing Spark Context")
boolean isProvidedJavaSparkContext();
Copy link
Member

@iemejia iemejia May 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't create this as a boolean option. but infer it from the JavaSparkContext presence if the user passes it. If he passes the context is because he wants to use it. And I would put the JavaSparkContext as an option in the SparkPipelineOptions object.

@jbonofre
Copy link
Member

Why not just creating an extend on SparkPipelineOptions injecting the SparkContext ?

@jbonofre
Copy link
Member

R: @jbonofre

@amarouni
Copy link
Contributor Author

@iemejia @jbonofre
I tried extending the SparkPipelineOptions by adding the get/setSparkJavaContext() but kept getting PipelineOptions serialization issues that's why I decided to store the Spark Context in the runner and not in the PipelineOptions.
Maybe you have another workaround.

@iemejia
Copy link
Member

iemejia commented May 31, 2016

I thought that there could be issues with serialization, but I crossed my fingers it was not the case mmmm... this can be trickier then.

@jbonofre
Copy link
Member

@amarouni good point.

@iemejia
Copy link
Member

iemejia commented May 31, 2016

I have to be honest, passing the SparkContext seems a bit hacky to me, since I think the Spark Runner should have the responsability for the runtime of their SparkContext(s). Another option would be to change the runner so it instantiates a n number of contexts (1 by default) and reuses them if needed. What do you think about this other approach guys ?
Of course I assume we want to do this to reduce the initialization time of the contexts (for example).

@jbonofre
Copy link
Member

The Spark runner is "executed" for each pipeline "execution". If you create three different pipelines (in a main method), then you will have three different contexts. It will be hard then to manage the contexts and pass from one execution to the other.
The runner is basically stateless.
Even if it's "weird", the context inject is probably easier to implement. The "context pooling" is tricky to implement.

@amarouni
Copy link
Contributor Author

@iemejia The spark context should be seen as an additional configuration element for the Spark runner. This opens up the Spark Runner for more real use cases (Interactive Spark shell with beam, Spark job server, ...). Spark context pooling is hard to get right since we're limited to 1 Spark context per JVM.

@jbonofre
Copy link
Member

@amarouni I agree, it makes more sense to me. Now, the main use case is that Beam bootstraps the spark job. But we can extend this.

@amitsela
Copy link
Member

@amarouni this seems like a really interesting use case, but I have a few questions:

Why not just add isProvidedContext/setProvidedContext in SparkPipelineOption ?
If true, you should expect the context to be provided..

Did you check how this behaves with SparkSession in 2.0 ? because we're focusing our efforts towards 2.0 support to improve streaming support with it, so i'm not sure if this will still "fly". I still didn't get the chance to deep dive in there, but it's worth to understand it before we make this kind of a change.

@jbonofre
Copy link
Member

@amitsela I also proposed to "inject" the context in the pipeline option but it seems that @amarouni got serialization exception. That's why he did the other way around.

Good point for the 2.0 SparkSession as it's the target for the future releases.

@iemejia
Copy link
Member

iemejia commented May 31, 2016

I understand that the runner is stateless, but if I understood well, one of the advantages of passing the SparkContext is to make the initialization shorter (like spark-job-server does). Of course I know that the pooling part is way more complex but if the goal is to reduce the initialization cost it is one possible way to achieve it, however we lose the flexibility of all the magic things you can do with the context before you inject it, but this will resolve the serialization issue.

The other approach, the configuration (via the SparkRunner or the SparkOptions) is runner specific, my idea was more in the line of creating a somehow generic option to reuse pools of contexts in a more general and 'runner agnostic' way with a parameter like '--contexts 10'. But well maybe this does not make sense in other runners.

@iemejia
Copy link
Member

iemejia commented May 31, 2016

@amitsela Do you have some 'public' branch for the ongoing work on the spark 2.0 integration that we can check/test ?

@jbonofre
Copy link
Member

@amitsela
Copy link
Member

@jbonofre this is the "cleaner" version, which will become the PR - https://github.com/amitsela/incubator-beam/commits/BEAM-198

@amitsela
Copy link
Member

It's a bit stuck because of some serialization issues, but it seems like it's going to resolve in a few days, and I plan to get back to it next week.

But to the point of this PR - I didn't ask why the context is not a part of the options but rather why have more PipelineOptions implementations ? IF this is a feature that benefits the runner, it should be (a boolean) option in the SparkPiplineOptions.

About optimising on initialisation - I don't see a great benefit, it's a bootstrap, so it'll take another second..
What caught my eye was spark-shell 😲

This could be really cool, though I think this is a bit more complex...

@amarouni
Copy link
Contributor Author

amarouni commented Jun 1, 2016

@amitsela I didn't get the chance to test with it Spark 2.0 sessions, but will do so as soon as I get some time. But is there any Beam spark 2.0 branch ?
Concerning the design I agree that I should have added the option to SparkPiplineOptions, I'll remove unnecessary classes/interfaces and stick to one additional option in SparkPiplineOptions.

@amitsela
Copy link
Member

amitsela commented Jun 1, 2016

No Spark-runner 2.0 branch, but there is no Spark 2.0 release yet, there is a working progress as I mentioned above against Spark 1.6

Considering the use case, I don't think that optimising on context initialization overhead matters since it's a bootstrap - but please correct me if I'm missing something.
As for the spark-shell.... How did you think of doing something like that ? that sounds interesting!

@amarouni
Copy link
Contributor Author

amarouni commented Jun 1, 2016

@amitsela Our core use-case is Beam on Spark Job Server (https://github.com/spark-jobserver/spark-jobserver). SJS creates/pools contexts, the contexts can also be setup with all necessary library dependencies and provided to something like Beam. That's how we got the idea of a SparkRunner with a provided context, and from there we saw the spark shell use case.

@dhalperi
Copy link
Contributor

R: @amitsela


EvaluationResult res = SparkRunner.create(options, jsc).run(p);
res.close();
jsc.stop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be implicit, probably to be handled by res.close() internally.

@amitsela
Copy link
Member

amitsela commented Aug 2, 2016

@amarouni I owe you an apology.. This somehow slipped by me until I got a kind reminder from a friend.
I've added a note about closing the context.
The case for using the spark-shell is very interesting, especially with Scio coming, but in the case of running in shell, the context should not be stopped, right ? if so, how do you propose we handle a provided context ? we know we're using a provided context, but how do we know if it's the job server, shell, or something else ?

@jbonofre
Copy link
Member

jbonofre commented Aug 2, 2016

We just discussed with Amit about having two different spark runners (one for Spark 1.x, another one for Spark 2.x). So, this PR could be merged on the Spark 1.x runner.

@amarouni
Copy link
Contributor Author

@amitsela @jbonofre
Thanks Amit and no problem !
I've just came back from vacation so sorry for this late response.

I think that the SparkRunner should just use the context as provided at the moment the Runner is created. The context stays external to the runner along with its state (running, stopped, ...) (we can add a isStopped call just before calling the execution logic to make sure that the context is up & running and fail the execution otherwise). This way the SparkRunner doesn't have to deal with the context's state and it doesn't matter if the context is an SJS or a shell one.

@amitsela
Copy link
Member

@amarouni sounds good to me, care to rebase and I'll give it a final look ?
Thanks!

@amarouni
Copy link
Contributor Author

@amitsela Here's what it's going to look like (commit 11).
I messed up my branch, I'll try to push a more proper version tomorrow.

@@ -195,6 +195,11 @@ public EvaluationResult run(Pipeline pipeline) {
JavaSparkContext jsc;
if (mOptions.isProvidedJavaSparkContext() && this.providedJavaSparkContext != null) {
LOG.info("Using a provided Spark Java Context.");
if (this.providedJavaSparkContext.sc().isStopped()){
LOG.error("The provided Spark context "
+ this.providedJavaSparkContext + " is stopped");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 space indent.

@amitsela
Copy link
Member

One thing I'm concerned about is potability - we're trying to avoid SparkRunner.create().run() in the pipeline, and use Pipeline.run() instead, which creates the PipelineRunner from the provided options.
This could prove tricky..
It's worth a shot to see if SparkPipelineOptions can accommodate the provided context, but please note it should be annotated with @JsonIgnore

@amarouni
Copy link
Contributor Author

@amitsela @jbonofre Done, I've pushed (--force).

@jbonofre
Copy link
Member

@amarouni Thanks, I gonna take a look.

@@ -44,4 +47,18 @@
@Default.Long(1000)
Long getBatchIntervalMillis();
void setBatchIntervalMillis(Long batchInterval);

@Override
@Default.String("spark dataflow pipeline job")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed a while ago since it is inherited from ApplicationNameOptions

@amitsela
Copy link
Member

@amarouni I added some comments, I think there some merges of older versions found there way back in your PR, I've commented on that.
Besides my comments, looks fine, tests provide good coverage!
Thanks!

@amitsela
Copy link
Member

LGTM. waiting for tests to pass. Thanks!

@jbonofre
Copy link
Member

FYI, there's an issue on OSX machine for Travis CI. As Jenkins passed, I think we are good to go.

LGTM

@jbonofre
Copy link
Member

@amarouni note for future contribution: it's better to keep your commits in the head log using git rebase. It's largely easier to squash.

@amarouni
Copy link
Contributor Author

@jbonofre will do.
In this case a merge was easier since I had different versions and the PR was open long time ago.
In any case thanks accepting it and looking forward for more contributions.

@jbonofre
Copy link
Member

No problem, thanks for your contribution !

@jbonofre
Copy link
Member

Thanks for the rebase and squash. Testing/merging now.

@asfgit asfgit merged commit 017da7b into apache:master Aug 29, 2016
asfgit pushed a commit that referenced this pull request Aug 29, 2016
@amarouni amarouni deleted the mycbeam313 branch August 30, 2016 13:25
dhalperi added a commit to dhalperi/beam that referenced this pull request Sep 27, 2016
tvalentyn pushed a commit to tvalentyn/beam that referenced this pull request May 15, 2018
@kohlerm
Copy link

kohlerm commented Aug 23, 2018

@amitsela does this mean Beam works now with the Spark job server? The beam documentation even mentions the Spark job server, but it's not clear to me how it would work.

@amitsela
Copy link
Member

This PR (at the time), allowed to pass an existing SparkContext with the Spark runner (instead of the runner creating one exclusively), thus enabling integration with the Spark job server.
cc: @jbonofre @iemejia

@iemejia
Copy link
Member

iemejia commented Aug 23, 2018

@kohlerm One use of this is to improve the latency of starting Spark jobs by 'pre-heating' a SparkContext with the required environment and reusing these contexts on job executions at SJS (reusing contexts is something you should do with caution).

@kohlerm
Copy link

kohlerm commented Aug 27, 2018

Thanks for the quick replies!
@iemejia @jbonofre @amitsela
All I want to do is to use Beam on top of spark and use the SJS to start it, because the Spark cluster I have access to only allows starting jobs with SJS.
Is there a code (Java) snippet somewhere , how to do this (if it's possible at all)?
Sorry for my ignorance, but I'm neither a Spark nor a Beam expert yet.

@amarouni
Copy link
Contributor Author

@kohlerm

Here's an old Scala snippet that shows how to use Beam & SJS, it's based on old versions of SJS & Beam so it probably won't compile with new SJS/Beam versions but you'll get the idea :

import com.typesafe.config.Config
import org.apache.beam.runners.spark.{ SparkContextOptions, SparkRunner }
import org.apache.beam.sdk.Pipeline
import org.apache.beam.sdk.coders.StringUtf8Coder
import org.apache.beam.sdk.options.PipelineOptionsFactory
import org.apache.beam.sdk.transforms.Create
import org.apache.spark.SparkContext
import org.apache.spark.api.java.JavaSparkContext
import spark.jobserver.{ SparkJob, SparkJobInvalid, SparkJobValid, SparkJobValidation }

import scala.collection.JavaConversions
import scala.util.Try

/**
 * Beam wordcount test. Returns the word count of a fixed String seq.
 */
object BeamWordCount extends SparkJob {
  override def validate(sc: SparkContext, config: Config): SparkJobValidation = {

    Try(config.getStringList("wordList"))
      .map(x => SparkJobValid)
      .getOrElse(SparkJobInvalid("No wordList in context config"))
  }

  override def runJob(sc: SparkContext, jobConfig: Config): Any = {

    // Input test list
    val inputBuffer = scala.collection.JavaConversions.asScalaBuffer(jobConfig.getStringList("wordList"))
    val WORDS = inputBuffer.toList

    // Pipeline options
    val sparkPipelineOptions = PipelineOptionsFactory.as(classOf[SparkContextOptions])
    sparkPipelineOptions.setAppName("Beam WordCount test")
    sparkPipelineOptions.setRunner(classOf[SparkRunner])
    sparkPipelineOptions.setUsesProvidedSparkContext(true)
    sparkPipelineOptions.setProvidedSparkContext(new JavaSparkContext(sc))

    // Pipeline
    val pipeline = Pipeline.create(sparkPipelineOptions)

    // Input + processing + Output
    val output = pipeline
      .apply(Create.of(JavaConversions.seqAsJavaList(WORDS)).withCoder(StringUtf8Coder.of()))
      .apply(new CountWords())

    // Result
    // val result: EvaluationResult = pipeline.run().asInstanceOf[EvaluationResult]

    // Run job & wait until finish
    pipeline.run().waitUntilFinish()
  }
}

It'd be nice to contribute this as new documentation if you manage to get it to work.

@kohlerm
Copy link

kohlerm commented Aug 27, 2018

Thanks a lot! Of course in case we get it work I would try to document it. I think this is a pretty common use case.

@kohlerm
Copy link

kohlerm commented Aug 31, 2018

https://gist.github.com/kohlerm/649f92195e71697a1931f81def176ec9

worked for us.
Thanks for the hint!

pl04351820 pushed a commit to pl04351820/beam that referenced this pull request Dec 20, 2023
Source-Link: googleapis/synthtool@52e4e46
Post-Processor: gcr.io/repo-automation-bots/owlbot-python:latest@sha256:6186535cbdbf6b9fe61f00294929221d060634dae4a0795c1cefdbc995b2d605
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants