Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark sql support #941

Merged
merged 1 commit into from
Oct 25, 2017
Merged

Spark sql support #941

merged 1 commit into from
Oct 25, 2017

Conversation

fwbrasil
Copy link
Collaborator

@fwbrasil fwbrasil commented Oct 21, 2017

Fixes #93

Problem

Quill could be used as a more user-friendly way to use Spark's SQL engine. The Dataset API is unintuitive and has many untyped methods.

Solution

Create a new module that allows users to define queries using Quill's DSL to run on top of Datasets and RDDs.

Notes

  • Spark doesn't have a Scala 2.12 version yet. I upgraded to SBT 1.0 as a tentative to fix the problems with sbt-doge. It didn't really solve the issue. I reverted the SBT 1.0 upgrade and used a hack using a system property.
  • Spark doesn't support bind variables, so Quill has to encode values as strings. Also, not all types supported by SqlContext are supported by Spark. That's why the context extends directly from Context.

Checklist

  • Unit test all changes
  • Update README.md if applicable
  • Add [WIP] to the pull request title if it's work in progress
  • Squash commits that aren't meaningful changes
  • Run sbt scalariformFormat test:scalariformFormat to make sure that the source files are formatted

@getquill/maintainers

@fwbrasil fwbrasil changed the title Spark sql support [WIP] Spark sql support Oct 22, 2017
@fwbrasil
Copy link
Collaborator Author

@getquill/maintainers this is ready for review

@fwbrasil fwbrasil changed the title [WIP] Spark sql support Spark sql support Oct 25, 2017
@fwbrasil
Copy link
Collaborator Author

fwbrasil commented Oct 25, 2017

If someone is interested in testing this new module, just change the Quill version to 2.1.0. Note that spark doesn't have a scala 2.12 release yet, so the project needs to be on 2.11.

The documentation is here and here is an example of a spark job.

Copy link
Collaborator

@mosyp mosyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work!

def probe(statement: String): Try[_] = Success(Unit)

val idiom = SparkDialect
val naming = Literal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that using different naming strategies is not possible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, given that the new module uses spark encoders and they don't support naming strategies, only Literal can be used.

val ((an, iAn, on), ont) = dealias(a, iA, o)((_, _, _))
val ((bn, iBn, onn), _) = ont.dealias(b, iB, on)((_, _, _))
val ((an, iAn, on), _) = dealias(a, iA, o)((_, _, _))
val ((bn, iBn, onn), _) = dealias(b, iB, on)((_, _, _))
Copy link
Collaborator

@mosyp mosyp Oct 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give more context on these changes? I was debugging #939 and found that the root issue comes from this class. (this comment does not relate to spark module, so we could continue this discussion somewhere else)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was another bug actually. It was using the same alias for the case where the same entity is on both sides of the join (qr1).join(qr1). Each branch of the join should be treated as independent, so it doesn't make sense to use the transformer of the first one.

org: User
)

object GithubExample extends App {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should create SPARK.md which would show all advantages of quill against raw spark.
And add link to this file in readme, since imo, existing example in readme does not show "Tired of the #scala #spark untyped madness?" :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm actually planning to write a blog post based on this example. I think I'll wait for a release with the new module first, though.

q match {
case q: Entity => Some(q)
case q: Infix => Some(q)
case _ => None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How this could be related to other quill modules? I suppose these changes required to generate queries like this:
SELECT x1.age _1 FROM (?) x1 WHERE x1.name = ?. However does it impact other world somehow? e.g. adds new opportunities?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this is a bug fix. Users should be able to use an infix in an entity position.

Copy link
Collaborator

@mosyp mosyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adelbertc
Copy link

Hey man, this is really cool, glad to see more people taking type safety in Spark seriously! I was wondering if you've seen the Frameless project? We started with providing a type safe query for Spark SQL and have since started supporting other functionality as well.

Great work!

@fwbrasil fwbrasil merged commit a4c8432 into master Oct 25, 2017
@fwbrasil fwbrasil deleted the spark branch October 25, 2017 18:54
@fwbrasil
Copy link
Collaborator Author

@adelbertc Thanks for the feedback! :) It's indeed great to see type-safe solutions like Quill and Frameless for spark sql. I imagine Dataset is one the most used scala DSLs, which is terrible news given how unintuitive it is.

I know Frameless, but I'm not a big fan of the Dataset API in general (select, where, etc). I think it's much more natural if the user can express the computation using a collection-like API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark SQL integration
3 participants