Connecting a REPL - and why it's not a problem to not being able to run arbitrary code #4

chrisbetz · 2015-01-26T09:10:02Z

How to configure nrepl-Server
Security
Serialization explained (esp. mention bindings/closures)
How to work with a Clustered-Apps nrepl and not run into serialization issues.

jstokes · 2015-12-08T03:13:39Z

Hey Chris - I would love to hear if you have some thoughts on this, since I'm starting to work a lot with sparkling and am in the process of figuring out a good workflow. Thanks!

chrisbetz · 2015-12-08T09:30:50Z

Hi,

ok, since you're asking, I'm just writing my answer here, maybe it'll make it to some documentation afterwards:

I see three scenarios for using a REPL with Apache Spark.

(1) for REPL-driven development, including (small scale) interactive data exploration
(2) for debugging a running Spark application
(3) for interactive data exploration (esp. on top of a predefined/pre-calculated/cached set of RDDs).

These scenarios not only are very different in the 'why', but also in the 'what'.

(1) requires the complete freedom, and very fast answers. To me, a workflow is broken if I need to wait seconds for a REPL-answer. It's the same as in unit testing: it has to be fast. Thus, I usually try to minimize the amount of data processed and I even delay touching spark as far as I can go.

This is exactly the reason why I made broadcasts '@' dereferencable: I can use anything dereferencable instead of a real broadcast during tests. So instead of (get (sparkling.broadcast/value my-broadcast) my-id) I can use (@my-broadcast my-id) in the functions I use in my spark app and test it using atoms instead of fireing up a complete local spark context.

Wherever I need, I create a local spark context with minimal data, e.g. by parallelize or parallelize-pair known data or by sampling (sparkling.core/sample) given data from my data nodes.

Now, most importantly: This local REPL does not need compilation (most of the time), so I can keep AOT to a minimum (kryo registration has to work, I'm sometimes use comparators which may not be re-compiled internally, stuff like that). Basically, this is "normal" REPL behaviour during development.

Compilation kicks in, when I run things on the Cluster, i.e. with remote executors.

(2) for debugging, I hook into an nrepl server on my Spark driver. There, I can fire new Spark jobs to collect results from RDDs already cached throughout the calculation. Here, I cannot do anything with complete freedom, because stuff needs to be precompiled if it travels the boundaries of VMs (i.e. from driver to executor). But in my opinion and from my experience, that's rarely necessary: For debugging, I usually had to lookup things from several RDDs and work on these values. As these values reside in the driver after looking them up, I do not need pre-compiled functions. Very rarely I need to issue spark jobs with 'new' functions, and actually, most of the time comp and partial did the trick to compose existing functions into what I needed.

(3) for interactive data exploration I started a GorillaREPL, connected to an nrepl-Server inside my driver. This is the only case where (sometimes) I see a necessity to run arbitrary jobs on the Spark cluster. But most of the time I was content with running my existing tooling on (new) data.

Hope this gives you an impression of my workflow and helps you design yours. If yours differ, I'd like to hear about it, because you never stop learning ;)

Sincerly

Chris

jstokes · 2015-12-11T14:36:46Z

Thanks for the thorough reply. I really appreciate all the time you spend on this awesome project.

Currently, our workflow is a little less sophisticated - we will start a repl on a large EC2 instance, load up a local spark context and work with freedom. The functions we define in the repl don't need to be sent out worker nodes, since we are operating on a single instance. This falls apart, however, when we need to operate on a cluster instead of a single machine. I think what you have outlined in number 2 will help us solve the cluster problem, and I now have more homework to look more into broadcast variables and the way they are used in spark and sparkling.

Thanks,
Jeff

plandes · 2017-02-28T17:59:12Z

FYI-- There is another project with a goal to eliminate the AOT requirement for a cluster setup: https://github.com/HCADatalab/powderkeg

However, it is currently on an older version of Spark and I doubt they support spark streaming.

I think it would be great to merge these projects to get the benefits of both.

retnuh · 2017-03-01T16:03:41Z

My understanding of powderkeg is that you pretty much don't use any of the Spark/RDD abstractions, etc. They use transducers and as few spark specific calls as possible. But I agree, it would be pretty nice to be able to use both! H

…

On 28 February 2017 at 17:59, Paul Landes ***@***.***> wrote: FYI-- There is another project with a goal to eliminate the AOT requirement for a cluster setup: https://github.com/HCADatalab/powderkeg However, it is currently on an older version of Spark and I doubt they support spark streaming. I think it would be great to merge these projects to get the benefits of both. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AARNDxk5AOt04158ACW2Ollmh0ebDRfdks5rhGBxgaJpZM4DXFY7> .

chrisbetz · 2017-03-02T11:45:42Z

Hi, I had the chance to ask Christophe Grand on the ClojureD about Powderkeg very shortly. It definitively looks promising and I like the ability to connect the REPL to the cluster for interactive development.

The magic happens by using an Agent in the REPL to transfer the classes. That's a Spark mechanism I never used but which is also possible to use with Sparkling. Maybe we can combine that stuff. Will have a deeper look at Powderkey anyhow, but do not expect anything to come soon. I also think right at the moment, that this might be a feature for another library like "sparkling-repl". Please +1 if would love that feature for your work.

retnuh · 2017-03-02T12:38:09Z

+1
My understanding is that it is a JVM Agent, not a Spark specific thing. That makes it trickier to use when you don't necessarily have full control over the way remote JVMs are started (such as with Amazon EMR), but it may still be possible - there is a ton of flexibility in what happens in EMR. It's been awhile since I had to dig into any of that stuff though, so I don't recall off the top of my head.

plandes · 2017-03-02T16:01:05Z

That's correct: a JVM agent is a Java 1.7+ tooling and is implemented in Powderkeg here: https://github.com/HCADatalab/powderkeg/blob/master/src/main/java/powderkeg/Agent.java

Good to know about this needing to have access to the JVM for the instrumentation. I've just started to learn about this project.

chrisbetz added documentation guide help wanted labels Jan 26, 2015

chrisbetz self-assigned this Jan 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connecting a REPL - and why it's not a problem to not being able to run arbitrary code #4

Connecting a REPL - and why it's not a problem to not being able to run arbitrary code #4

chrisbetz commented Jan 26, 2015

jstokes commented Dec 8, 2015

chrisbetz commented Dec 8, 2015

jstokes commented Dec 11, 2015

plandes commented Feb 28, 2017

retnuh commented Mar 1, 2017 via email

chrisbetz commented Mar 2, 2017

retnuh commented Mar 2, 2017

plandes commented Mar 2, 2017

Connecting a REPL - and why it's not a problem to not being able to run arbitrary code #4

Connecting a REPL - and why it's not a problem to not being able to run arbitrary code #4

Comments

chrisbetz commented Jan 26, 2015

jstokes commented Dec 8, 2015

chrisbetz commented Dec 8, 2015

jstokes commented Dec 11, 2015

plandes commented Feb 28, 2017

retnuh commented Mar 1, 2017 via email

chrisbetz commented Mar 2, 2017

retnuh commented Mar 2, 2017

plandes commented Mar 2, 2017