Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connecting a REPL - and why it's not a problem to not being able to run arbitrary code #4

Open
chrisbetz opened this issue Jan 26, 2015 · 8 comments

Comments

@chrisbetz
Copy link
Contributor

  • How to configure nrepl-Server
  • Security
  • Serialization explained (esp. mention bindings/closures)
  • How to work with a Clustered-Apps nrepl and not run into serialization issues.
@jstokes
Copy link

jstokes commented Dec 8, 2015

Hey Chris - I would love to hear if you have some thoughts on this, since I'm starting to work a lot with sparkling and am in the process of figuring out a good workflow. Thanks!

@chrisbetz
Copy link
Contributor Author

Hi,

ok, since you're asking, I'm just writing my answer here, maybe it'll make it to some documentation afterwards:

I see three scenarios for using a REPL with Apache Spark.

(1) for REPL-driven development, including (small scale) interactive data exploration
(2) for debugging a running Spark application
(3) for interactive data exploration (esp. on top of a predefined/pre-calculated/cached set of RDDs).

These scenarios not only are very different in the 'why', but also in the 'what'.

(1) requires the complete freedom, and very fast answers. To me, a workflow is broken if I need to wait seconds for a REPL-answer. It's the same as in unit testing: it has to be fast. Thus, I usually try to minimize the amount of data processed and I even delay touching spark as far as I can go.

This is exactly the reason why I made broadcasts '@' dereferencable: I can use anything dereferencable instead of a real broadcast during tests. So instead of (get (sparkling.broadcast/value my-broadcast) my-id) I can use (@my-broadcast my-id) in the functions I use in my spark app and test it using atoms instead of fireing up a complete local spark context.

Wherever I need, I create a local spark context with minimal data, e.g. by parallelize or parallelize-pair known data or by sampling (sparkling.core/sample) given data from my data nodes.

Now, most importantly: This local REPL does not need compilation (most of the time), so I can keep AOT to a minimum (kryo registration has to work, I'm sometimes use comparators which may not be re-compiled internally, stuff like that). Basically, this is "normal" REPL behaviour during development.

Compilation kicks in, when I run things on the Cluster, i.e. with remote executors.

(2) for debugging, I hook into an nrepl server on my Spark driver. There, I can fire new Spark jobs to collect results from RDDs already cached throughout the calculation. Here, I cannot do anything with complete freedom, because stuff needs to be precompiled if it travels the boundaries of VMs (i.e. from driver to executor). But in my opinion and from my experience, that's rarely necessary: For debugging, I usually had to lookup things from several RDDs and work on these values. As these values reside in the driver after looking them up, I do not need pre-compiled functions. Very rarely I need to issue spark jobs with 'new' functions, and actually, most of the time comp and partial did the trick to compose existing functions into what I needed.

(3) for interactive data exploration I started a GorillaREPL, connected to an nrepl-Server inside my driver. This is the only case where (sometimes) I see a necessity to run arbitrary jobs on the Spark cluster. But most of the time I was content with running my existing tooling on (new) data.

Hope this gives you an impression of my workflow and helps you design yours. If yours differ, I'd like to hear about it, because you never stop learning ;)

Sincerly

Chris

@jstokes
Copy link

jstokes commented Dec 11, 2015

Thanks for the thorough reply. I really appreciate all the time you spend on this awesome project.

Currently, our workflow is a little less sophisticated - we will start a repl on a large EC2 instance, load up a local spark context and work with freedom. The functions we define in the repl don't need to be sent out worker nodes, since we are operating on a single instance. This falls apart, however, when we need to operate on a cluster instead of a single machine. I think what you have outlined in number 2 will help us solve the cluster problem, and I now have more homework to look more into broadcast variables and the way they are used in spark and sparkling.

Thanks,
Jeff

@plandes
Copy link

plandes commented Feb 28, 2017

FYI-- There is another project with a goal to eliminate the AOT requirement for a cluster setup: https://github.com/HCADatalab/powderkeg

However, it is currently on an older version of Spark and I doubt they support spark streaming.

I think it would be great to merge these projects to get the benefits of both.

@retnuh
Copy link

retnuh commented Mar 1, 2017 via email

@chrisbetz
Copy link
Contributor Author

Hi, I had the chance to ask Christophe Grand on the ClojureD about Powderkeg very shortly. It definitively looks promising and I like the ability to connect the REPL to the cluster for interactive development.

The magic happens by using an Agent in the REPL to transfer the classes. That's a Spark mechanism I never used but which is also possible to use with Sparkling. Maybe we can combine that stuff. Will have a deeper look at Powderkey anyhow, but do not expect anything to come soon. I also think right at the moment, that this might be a feature for another library like "sparkling-repl". Please +1 if would love that feature for your work.

@retnuh
Copy link

retnuh commented Mar 2, 2017

+1
My understanding is that it is a JVM Agent, not a Spark specific thing. That makes it trickier to use when you don't necessarily have full control over the way remote JVMs are started (such as with Amazon EMR), but it may still be possible - there is a ton of flexibility in what happens in EMR. It's been awhile since I had to dig into any of that stuff though, so I don't recall off the top of my head.

@plandes
Copy link

plandes commented Mar 2, 2017

That's correct: a JVM agent is a Java 1.7+ tooling and is implemented in Powderkeg here: https://github.com/HCADatalab/powderkeg/blob/master/src/main/java/powderkeg/Agent.java

Good to know about this needing to have access to the JVM for the instrumentation. I've just started to learn about this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants