Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Data Frames #8

Open
chrisbetz opened this issue Feb 19, 2015 · 25 comments
Open

Add support for Data Frames #8

chrisbetz opened this issue Feb 19, 2015 · 25 comments

Comments

@chrisbetz
Copy link
Contributor

See https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

@chetmancini
Copy link

@chrisbetz I've been looking into this and have a local branch wrapping the DataFrame API using flambo. One issue is that the Spark SQL API in 1.2 and 1.3 is pretty different (e.g. no more JavaSQLContext or JavaSchemaRDD).

How do you plan to version Sparkling across Spark versions? Would you rather try to support both in a release or keep things separate? Put up a minimal example of what it took to get my flambo tests green in 1.3 here: sorenmacbeth/flambo#48

Might be hard to run things in parallel, but once there's essential 1.3 compat I don't imagine it'd be too hard to build out some functions to work with DataFrames. I personally would love to have a Clojure API for my applications soon; let me know if there's a way I can contribute.

@chrisbetz
Copy link
Contributor Author

Hi,

thanks for offering to contribute! That’s great. I’ve not looked into the DataFrame API any further (just checked out the announcement document). But it looks promising and I really would like to support it.

Concerning the versioning - I’d like to think about that over the Easter holiday. Currently, I see two options:

a) Having different namespaces in the same project
b) branching off sparkling-1.2.0-X.Y.Z and sparkling-1.3.0-X.Y.Z.

If you see any other good options, just tell me.

I’ll come back to you regarding this.

Sincerly,

Chris

Am 02.04.2015 um 23:12 schrieb Chet notifications@github.com:

@chrisbetz https://github.com/chrisbetz I've been looking into this and have a local branch wrapping the DataFrame API using flambo. One issue is that the Spark SQL API in 1.2 and 1.3 is pretty different (e.g. no more JavaSQLContext or JavaSchemaRDD).

How do you plan to version Sparkling across Spark versions? Would you rather try to support both in a release or keep things separate? Put up a minimal example of what it took to get my flambo tests green in 1.3 here: sorenmacbeth/flambo#48 sorenmacbeth/flambo#48
Might be hard to run things in parallel, but once there's essential 1.3 compat I don't imagine it'd be too hard to build out some functions to work with DataFrames. I personally would love to have a Clojure API for my applications soon; let me know if there's a way I can contribute.


Reply to this email directly or view it on GitHub #8 (comment).

@chetmancini
Copy link

@chrisbetz Great. Both those sound like viable options; once you pick a route I'll see where we could take support for this. Have a great Easter.

@erasmas
Copy link

erasmas commented Apr 20, 2015

Hi guys! Any update on providing support for Data Frames?

@chrisbetz
Copy link
Contributor Author

Hi, sorry, no support for that yet, as we need to support at least spark 1.1 from CDH, spark 1.2.x and spark 1.3 and I need to find a way to support all of them. Currently, I'm on serialization tasks and thus a little busy. Data Frame Support will definitively be the next thing to add, so stay tuned. Sorry, but coming up with a way to go requires some researching and testing around.

@chetmancini
Copy link

@erasmas Currently I'm working on getting dataframe support into Flambo at the moment since that's what I'm using in prod (looking at switching to sparkling once I get some time to compare). Codes getting there but I've been having some issues getting Spark 1.3 to run on the cluster for final testing.

@prateekbhatt
Copy link

Hi @chrisbetz @chetmancini Any updates on Data Frames support ?

@retnuh
Copy link

retnuh commented Jan 5, 2016

I may have some time to wrap some of the code that I've written, but I've only ever used Spark 1.5.x.

@chrisbetz let me know how you'd like to proceed.

@alza-bitz
Copy link

Out of interest what form would a DataFrames wrapper take? For the reading & queries side of things would it be some declarative DSL similar to Datomic Datalog for example?

@retnuh
Copy link

retnuh commented Mar 1, 2016

I doubt it would look like Datalog. Considering that Sparkling's RDD
wrappers stick really close to the native interface, I'd say DataFrames
would be similar. The DSL would probably be sorta SQL like where you have
select statements with columns & expressions.

Going to far beyond that would probably impose quite an impedance
mis-match...

On 1 March 2016 at 00:25, alzadude notifications@github.com wrote:

Out of interest what form would a DataFrames wrapper take? For the reading
& queries side of things would it be some declarative DSL similar to
Datomic Datalog for example?


Reply to this email directly or view it on GitHub
#8 (comment)
.

@nabacg
Copy link

nabacg commented Apr 28, 2016

I'd really like to help, started putting something together the other day nabacg@ae935a5
very basic, not sure how far you guys got. Maybe we could join our efforts @retnuh ?

@retnuh
Copy link

retnuh commented Apr 28, 2016

I would like to help but I've not had much time to work on this lately -
nor will I in the near future.

What I have is mostly just code that uses DataFrames; I hadn't really
gotten to the point of abstracting out the useful stuff (like a select
function that examines it's args and "does the right thing" with wrapping
the args in an Array, if necessary, etc.)

On 28 April 2016 at 09:08, Grzegorz Caban notifications@github.com wrote:

I'd really like to help, started putting something together the other day
nabacg/sparkling@ae935a5
nabacg@ae935a5
very basic, not sure how far you guys got. Maybe we could join our efforts
@retnuh https://github.com/retnuh ?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8 (comment)

@MarchLiu
Copy link
Contributor

MarchLiu commented Jul 7, 2016

Hi,
I create a PR at #49 .
Most of the codes are working in my daily jobs. And I written some tests but not all. I will submit more tests for SQL and DataFrame functions.

@NeilMenne
Copy link

I have the following functionality that I could add:

  • Parquet Support
  • RDD <-> DataFrames
  • A handful of other SQL related functions that I needed for my project

One of the bigger outstanding problems that I see is how DataFrame joins work. The Java syntax needs a good macro wrapper, but I haven't had time to finish my attempt.

I don't want to step on @MarchLiu's efforts, so I'll wait until his changes are sorted out before I throw any of this into the mix. It looks solid. I like the how you made thread-ability a key part of your implementation. There were a couple spots where I should have done that but didn't.

@MafcoCinco
Copy link

@NeilMenne would very much be interested in Parquet Support, if possible.

@retnuh
Copy link

retnuh commented Mar 2, 2017 via email

@NeilMenne
Copy link

I no longer have access to the code I wrote at OpenTable. If there's a need for it, I could probably do a clean room implementation. I still use Spark at my current position, so it's fresh in my mind.

@MafcoCinco
Copy link

@NeilMenne That would be great, especially in the area of more idiomatic support for Parquet and RDD <-> DataFrames. If it is a ton of work, don't worry about it but would definitely be useful if you had the time.

@NeilMenne
Copy link

I'll have to get back up to speed on sparkling, but I'll see what I can do.

@MafcoCinco
Copy link

Awesome! Thanks so much.

@MafcoCinco
Copy link

My team has a hack project coming up and we were planning on using Sparkling as part of the implementation. I'm going to take a crack at building a API to data frames. If successful, I'll submit it as a PR. Just on background, it seems like there is some support already using a combination of the Java API + the new SQL API that was added in 2.x. Are there any examples of using the new SQL API and/or native (to Sparkling) data frame support? Just want to get a good picture of where I'm starting from in hopes I can avoid duplicating effort.

@chrisbetz
Copy link
Contributor Author

chrisbetz commented Apr 7, 2017 via email

@MafcoCinco
Copy link

I submitted a PR for adding support for SparkSession API. I think this will address most of what I personally need w.r.t. DataFrame and Parquet support, but I'm sure the implementation can be improved and made more complete.

@xsyn
Copy link
Contributor

xsyn commented Aug 23, 2017

I realise that I may be flogging a dead horse, and that another PR was merged in instead of @MafcoCinco's, however there were some really nice utility functions which @MafcoCinco had written which I would have loved. Specifically the dataframe->rdd-of- functions. @chrisbetz would you be open to negotiation on brining in some of these functions, are has that ship sailed. Should I rather be building these functions as a utility library for my projects.

Again, forgive me if this is out of line, I just think they're incredibly useful utilities and something I've found myself reaching for recently.

@chrisbetz
Copy link
Contributor Author

chrisbetz commented Aug 24, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests