-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connecting a REPL - and why it's not a problem to not being able to run arbitrary code #4
Comments
Hey Chris - I would love to hear if you have some thoughts on this, since I'm starting to work a lot with sparkling and am in the process of figuring out a good workflow. Thanks! |
Hi, ok, since you're asking, I'm just writing my answer here, maybe it'll make it to some documentation afterwards: I see three scenarios for using a REPL with Apache Spark. (1) for REPL-driven development, including (small scale) interactive data exploration These scenarios not only are very different in the 'why', but also in the 'what'. (1) requires the complete freedom, and very fast answers. To me, a workflow is broken if I need to wait seconds for a REPL-answer. It's the same as in unit testing: it has to be fast. Thus, I usually try to minimize the amount of data processed and I even delay touching spark as far as I can go. This is exactly the reason why I made broadcasts '@' dereferencable: I can use anything dereferencable instead of a real broadcast during tests. So instead of (get (sparkling.broadcast/value my-broadcast) my-id) I can use (@my-broadcast my-id) in the functions I use in my spark app and test it using atoms instead of fireing up a complete local spark context. Wherever I need, I create a local spark context with minimal data, e.g. by parallelize or parallelize-pair known data or by sampling (sparkling.core/sample) given data from my data nodes. Now, most importantly: This local REPL does not need compilation (most of the time), so I can keep AOT to a minimum (kryo registration has to work, I'm sometimes use comparators which may not be re-compiled internally, stuff like that). Basically, this is "normal" REPL behaviour during development. Compilation kicks in, when I run things on the Cluster, i.e. with remote executors. (2) for debugging, I hook into an nrepl server on my Spark driver. There, I can fire new Spark jobs to collect results from RDDs already cached throughout the calculation. Here, I cannot do anything with complete freedom, because stuff needs to be precompiled if it travels the boundaries of VMs (i.e. from driver to executor). But in my opinion and from my experience, that's rarely necessary: For debugging, I usually had to lookup things from several RDDs and work on these values. As these values reside in the driver after looking them up, I do not need pre-compiled functions. Very rarely I need to issue spark jobs with 'new' functions, and actually, most of the time comp and partial did the trick to compose existing functions into what I needed. (3) for interactive data exploration I started a GorillaREPL, connected to an nrepl-Server inside my driver. This is the only case where (sometimes) I see a necessity to run arbitrary jobs on the Spark cluster. But most of the time I was content with running my existing tooling on (new) data. Hope this gives you an impression of my workflow and helps you design yours. If yours differ, I'd like to hear about it, because you never stop learning ;) Sincerly Chris |
Thanks for the thorough reply. I really appreciate all the time you spend on this awesome project. Currently, our workflow is a little less sophisticated - we will start a repl on a large EC2 instance, load up a local spark context and work with freedom. The functions we define in the repl don't need to be sent out worker nodes, since we are operating on a single instance. This falls apart, however, when we need to operate on a cluster instead of a single machine. I think what you have outlined in number 2 will help us solve the cluster problem, and I now have more homework to look more into broadcast variables and the way they are used in spark and sparkling. Thanks, |
FYI-- There is another project with a goal to eliminate the AOT requirement for a cluster setup: https://github.com/HCADatalab/powderkeg However, it is currently on an older version of Spark and I doubt they support spark streaming. I think it would be great to merge these projects to get the benefits of both. |
My understanding of powderkeg is that you pretty much don't use any of the
Spark/RDD abstractions, etc. They use transducers and as few spark
specific calls as possible.
But I agree, it would be pretty nice to be able to use both!
H
…On 28 February 2017 at 17:59, Paul Landes ***@***.***> wrote:
FYI-- There is another project with a goal to eliminate the AOT
requirement for a cluster setup: https://github.com/HCADatalab/powderkeg
However, it is currently on an older version of Spark and I doubt they
support spark streaming.
I think it would be great to merge these projects to get the benefits of
both.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AARNDxk5AOt04158ACW2Ollmh0ebDRfdks5rhGBxgaJpZM4DXFY7>
.
|
Hi, I had the chance to ask Christophe Grand on the ClojureD about Powderkeg very shortly. It definitively looks promising and I like the ability to connect the REPL to the cluster for interactive development. The magic happens by using an Agent in the REPL to transfer the classes. That's a Spark mechanism I never used but which is also possible to use with Sparkling. Maybe we can combine that stuff. Will have a deeper look at Powderkey anyhow, but do not expect anything to come soon. I also think right at the moment, that this might be a feature for another library like "sparkling-repl". Please +1 if would love that feature for your work. |
+1 |
That's correct: a JVM agent is a Java 1.7+ tooling and is implemented in Powderkeg here: https://github.com/HCADatalab/powderkeg/blob/master/src/main/java/powderkeg/Agent.java Good to know about this needing to have access to the JVM for the instrumentation. I've just started to learn about this project. |
The text was updated successfully, but these errors were encountered: