Add support for Data Frames #8

chrisbetz · 2015-02-19T10:08:40Z

See https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

chetmancini · 2015-04-02T21:12:45Z

@chrisbetz I've been looking into this and have a local branch wrapping the DataFrame API using flambo. One issue is that the Spark SQL API in 1.2 and 1.3 is pretty different (e.g. no more JavaSQLContext or JavaSchemaRDD).

How do you plan to version Sparkling across Spark versions? Would you rather try to support both in a release or keep things separate? Put up a minimal example of what it took to get my flambo tests green in 1.3 here: sorenmacbeth/flambo#48

Might be hard to run things in parallel, but once there's essential 1.3 compat I don't imagine it'd be too hard to build out some functions to work with DataFrames. I personally would love to have a Clojure API for my applications soon; let me know if there's a way I can contribute.

chrisbetz · 2015-04-02T21:19:51Z

Hi,

thanks for offering to contribute! That’s great. I’ve not looked into the DataFrame API any further (just checked out the announcement document). But it looks promising and I really would like to support it.

Concerning the versioning - I’d like to think about that over the Easter holiday. Currently, I see two options:

a) Having different namespaces in the same project
b) branching off sparkling-1.2.0-X.Y.Z and sparkling-1.3.0-X.Y.Z.

If you see any other good options, just tell me.

I’ll come back to you regarding this.

Sincerly,

Chris

Am 02.04.2015 um 23:12 schrieb Chet notifications@github.com:

@chrisbetz https://github.com/chrisbetz I've been looking into this and have a local branch wrapping the DataFrame API using flambo. One issue is that the Spark SQL API in 1.2 and 1.3 is pretty different (e.g. no more JavaSQLContext or JavaSchemaRDD).

How do you plan to version Sparkling across Spark versions? Would you rather try to support both in a release or keep things separate? Put up a minimal example of what it took to get my flambo tests green in 1.3 here: sorenmacbeth/flambo#48 sorenmacbeth/flambo#48
Might be hard to run things in parallel, but once there's essential 1.3 compat I don't imagine it'd be too hard to build out some functions to work with DataFrames. I personally would love to have a Clojure API for my applications soon; let me know if there's a way I can contribute.

—
Reply to this email directly or view it on GitHub #8 (comment).

chetmancini · 2015-04-02T21:23:20Z

@chrisbetz Great. Both those sound like viable options; once you pick a route I'll see where we could take support for this. Have a great Easter.

erasmas · 2015-04-20T08:18:52Z

Hi guys! Any update on providing support for Data Frames?

chrisbetz · 2015-04-20T08:26:59Z

Hi, sorry, no support for that yet, as we need to support at least spark 1.1 from CDH, spark 1.2.x and spark 1.3 and I need to find a way to support all of them. Currently, I'm on serialization tasks and thus a little busy. Data Frame Support will definitively be the next thing to add, so stay tuned. Sorry, but coming up with a way to go requires some researching and testing around.

chetmancini · 2015-04-27T16:12:41Z

@erasmas Currently I'm working on getting dataframe support into Flambo at the moment since that's what I'm using in prod (looking at switching to sparkling once I get some time to compare). Codes getting there but I've been having some issues getting Spark 1.3 to run on the cluster for final testing.

prateekbhatt · 2015-12-21T18:37:18Z

Hi @chrisbetz @chetmancini Any updates on Data Frames support ?

retnuh · 2016-01-05T09:52:35Z

I may have some time to wrap some of the code that I've written, but I've only ever used Spark 1.5.x.

@chrisbetz let me know how you'd like to proceed.

alza-bitz · 2016-03-01T00:25:48Z

Out of interest what form would a DataFrames wrapper take? For the reading & queries side of things would it be some declarative DSL similar to Datomic Datalog for example?

retnuh · 2016-03-01T08:56:47Z

I doubt it would look like Datalog. Considering that Sparkling's RDD
wrappers stick really close to the native interface, I'd say DataFrames
would be similar. The DSL would probably be sorta SQL like where you have
select statements with columns & expressions.

Going to far beyond that would probably impose quite an impedance
mis-match...

On 1 March 2016 at 00:25, alzadude notifications@github.com wrote:

Out of interest what form would a DataFrames wrapper take? For the reading
& queries side of things would it be some declarative DSL similar to
Datomic Datalog for example?

—
Reply to this email directly or view it on GitHub
#8 (comment)
.

nabacg · 2016-04-28T08:08:31Z

I'd really like to help, started putting something together the other day nabacg@ae935a5
very basic, not sure how far you guys got. Maybe we could join our efforts @retnuh ?

retnuh · 2016-04-28T08:48:09Z

I would like to help but I've not had much time to work on this lately -
nor will I in the near future.

What I have is mostly just code that uses DataFrames; I hadn't really
gotten to the point of abstracting out the useful stuff (like a select
function that examines it's args and "does the right thing" with wrapping
the args in an Array, if necessary, etc.)

On 28 April 2016 at 09:08, Grzegorz Caban notifications@github.com wrote:

I'd really like to help, started putting something together the other day
nabacg/sparkling@ae935a5
nabacg@ae935a5
very basic, not sure how far you guys got. Maybe we could join our efforts
@retnuh https://github.com/retnuh ?

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8 (comment)

MarchLiu · 2016-07-07T17:59:21Z

Hi,
I create a PR at #49 .
Most of the codes are working in my daily jobs. And I written some tests but not all. I will submit more tests for SQL and DataFrame functions.

NeilMenne · 2016-07-08T15:31:19Z

I have the following functionality that I could add:

Parquet Support
RDD <-> DataFrames
A handful of other SQL related functions that I needed for my project

One of the bigger outstanding problems that I see is how DataFrame joins work. The Java syntax needs a good macro wrapper, but I haven't had time to finish my attempt.

I don't want to step on @MarchLiu's efforts, so I'll wait until his changes are sorted out before I throw any of this into the mix. It looks solid. I like the how you made thread-ability a key part of your implementation. There were a couple spots where I should have done that but didn't.

MafcoCinco · 2017-03-02T04:27:47Z

@NeilMenne would very much be interested in Parquet Support, if possible.

retnuh · 2017-03-02T10:48:36Z

To be clear, you can work with DataFrames and use parquet files in the existing version, it's just annoying. You have to use the Java API more or less directly, and it suffers from some warts between Java <-> Scala interop, particularly in the area of varargs. I used it successfully but there was plenty of ugly code with creating and filling type specific arrays and weird calls where you have one string and then an array of strings, etc. It is currently do-able, just ugly. I talk a bit about it at the talk I gave at ClojureConj in 2015: https://youtu.be/ARBiyYyW4Ow?t=689 Slides: https://www.slideshare.net/ZalandoTech/spark-clojure-for-topic-discovery-zalando-tech-clojureconj-talk starting around slide 20-21 H EDIT: I posted this before I saw the 2.0 sparkling release, obviously!

…

On 2 March 2017 at 04:27, Marcus Oladell ***@***.***> wrote: @NeilMenne <https://github.com/NeilMenne> would very much be interested in Parquet Support, if possible. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AARND4Ph3ew3ufYznFK5HRRjARFl8FLAks5rhkVDgaJpZM4DimQL> .

NeilMenne · 2017-03-02T15:14:43Z

I no longer have access to the code I wrote at OpenTable. If there's a need for it, I could probably do a clean room implementation. I still use Spark at my current position, so it's fresh in my mind.

MafcoCinco · 2017-03-02T15:22:38Z

@NeilMenne That would be great, especially in the area of more idiomatic support for Parquet and RDD <-> DataFrames. If it is a ton of work, don't worry about it but would definitely be useful if you had the time.

NeilMenne · 2017-03-02T15:24:44Z

I'll have to get back up to speed on sparkling, but I'll see what I can do.

MafcoCinco · 2017-03-02T15:29:58Z

Awesome! Thanks so much.

MafcoCinco · 2017-04-06T19:56:22Z

My team has a hack project coming up and we were planning on using Sparkling as part of the implementation. I'm going to take a crack at building a API to data frames. If successful, I'll submit it as a PR. Just on background, it seems like there is some support already using a combination of the Java API + the new SQL API that was added in 2.x. Are there any examples of using the new SQL API and/or native (to Sparkling) data frame support? Just want to get a good picture of where I'm starting from in hopes I can avoid duplicating effort.

chrisbetz · 2017-04-07T05:47:42Z

Hi, Unfortunately, I do not have working examples for this. Maybe anybody out there? Please, share your question on the sparkling google group. If you ask on twitter, I could retweet from gorillalabs to reach out. Happy hacking! Chris

…

Am 06.04.2017 um 21:56 schrieb Marcus Oladell ***@***.***>: My team has a hack project coming up and we were planning on using Sparkling as part of the implementation. I'm going to take a crack at building a API to data frames. If successful, I'll submit it as a PR. Just on background, it seems like there is some support already using a combination of the Java API + the new SQL API that was added in 2.x. Are there any examples of using the new SQL API and/or native (to Sparkling) data frame support? Just want to get a good picture of where I'm starting from in hopes I can avoid duplicating effort. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

MafcoCinco · 2017-04-19T19:39:45Z

I submitted a PR for adding support for SparkSession API. I think this will address most of what I personally need w.r.t. DataFrame and Parquet support, but I'm sure the implementation can be improved and made more complete.

xsyn · 2017-08-23T18:34:54Z

I realise that I may be flogging a dead horse, and that another PR was merged in instead of @MafcoCinco's, however there were some really nice utility functions which @MafcoCinco had written which I would have loved. Specifically the dataframe->rdd-of- functions. @chrisbetz would you be open to negotiation on brining in some of these functions, are has that ship sailed. Should I rather be building these functions as a utility library for my projects.

Again, forgive me if this is out of line, I just think they're incredibly useful utilities and something I've found myself reaching for recently.

chrisbetz · 2017-08-24T14:00:44Z

Hi, thanks for your input, and yes, I'm open to these additions. If you like, just create a PR with the things you'd like to see and I will look into it after my vacation. Cheers, Chris

…

Am 23.08.2017 um 14:34 schrieb Guy Taylor ***@***.***>: I realise that I may be flogging a dead horse, and that another PR was merged in instead of @MafcoCinco's, however there were some really nice utility functions which @MafcoCinco had written which I would have loved. Specifically the dataframe->rdd-of- functions. @chrisbetz would you be open to negotiation on brining in some of these functions, are has that ship sailed. Should I rather be building these functions as a utility library for my projects. Again, forgive me if this is out of line, I just think they're incredibly useful utilities and something I've found myself reaching for recently. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

chrisbetz added enhancement help wanted labels Feb 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Data Frames #8

Add support for Data Frames #8

chrisbetz commented Feb 19, 2015

chetmancini commented Apr 2, 2015

chrisbetz commented Apr 2, 2015

chetmancini commented Apr 2, 2015

erasmas commented Apr 20, 2015

chrisbetz commented Apr 20, 2015

chetmancini commented Apr 27, 2015

prateekbhatt commented Dec 21, 2015

retnuh commented Jan 5, 2016

alza-bitz commented Mar 1, 2016

retnuh commented Mar 1, 2016

nabacg commented Apr 28, 2016

retnuh commented Apr 28, 2016

MarchLiu commented Jul 7, 2016 •

edited

Loading

NeilMenne commented Jul 8, 2016

MafcoCinco commented Mar 2, 2017

retnuh commented Mar 2, 2017 via email •

edited

Loading

NeilMenne commented Mar 2, 2017

MafcoCinco commented Mar 2, 2017

NeilMenne commented Mar 2, 2017

MafcoCinco commented Mar 2, 2017

MafcoCinco commented Apr 6, 2017

chrisbetz commented Apr 7, 2017 via email

MafcoCinco commented Apr 19, 2017

xsyn commented Aug 23, 2017

chrisbetz commented Aug 24, 2017 via email

Add support for Data Frames #8

Add support for Data Frames #8

Comments

chrisbetz commented Feb 19, 2015

chetmancini commented Apr 2, 2015

chrisbetz commented Apr 2, 2015

chetmancini commented Apr 2, 2015

erasmas commented Apr 20, 2015

chrisbetz commented Apr 20, 2015

chetmancini commented Apr 27, 2015

prateekbhatt commented Dec 21, 2015

retnuh commented Jan 5, 2016

alza-bitz commented Mar 1, 2016

retnuh commented Mar 1, 2016

nabacg commented Apr 28, 2016

retnuh commented Apr 28, 2016

MarchLiu commented Jul 7, 2016 • edited Loading

NeilMenne commented Jul 8, 2016

MafcoCinco commented Mar 2, 2017

retnuh commented Mar 2, 2017 via email • edited Loading

NeilMenne commented Mar 2, 2017

MafcoCinco commented Mar 2, 2017

NeilMenne commented Mar 2, 2017

MafcoCinco commented Mar 2, 2017

MafcoCinco commented Apr 6, 2017

chrisbetz commented Apr 7, 2017 via email

MafcoCinco commented Apr 19, 2017

xsyn commented Aug 23, 2017

chrisbetz commented Aug 24, 2017 via email

MarchLiu commented Jul 7, 2016 •

edited

Loading

retnuh commented Mar 2, 2017 via email •

edited

Loading