Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of current dplyr and LINQ like APIs #1025

Closed
davidanthoff opened this issue Aug 2, 2016 · 58 comments
Closed

Discussion of current dplyr and LINQ like APIs #1025

davidanthoff opened this issue Aug 2, 2016 · 58 comments

Comments

@davidanthoff
Copy link
Contributor

davidanthoff commented Aug 2, 2016

There are a bunch of efforts under way in this area, and this issue can serve as a discussion place for any cross-cutting issues. Currently, there seem to be these efforts (let me know if there are more and I'll add them):

There are various writeups that people should be familiar with, e.g.

@davidanthoff
Copy link
Contributor Author

@davidagold had a couple of questions about LINQ.jl in a discussion over there, I'll respond to them here:

One of the goals I'm trying to achieve is to disentangle the manipulation interface from an implementation vis actually manipulating an in-memory Julia table (be it a DataFrame or otherwise). My strategy for achieving this is to lower user manipulation commands to a graph and then in turn lower the graph based on the type of the underlying data source. It's not clear to me the extent to which you envision a similar objective for LINQ.jl. Could you elaborate on this?

Yes, I'm starting to work on this. The original LINQ architecture has two parts: an IEnumerable part and the IQueryable part. The latter builds up a query graph that can then be translated into SQL, or anything else (they had really interesting things like parallelism and cluster stuff implemented like that). I just started working on Queryable in LINQ.jl, so there is little to see at this point, but it is coming.

Insofar as both of our approaches involve or will involve "lowering" user commands to a graph layer (or some other internal representation of the structure of the query)

I think this is one of the differences: LINQ has one part that lowers things to graphs, but another one that is just based on enumerables/iterators. The nice thing about that is that it works over data sources that actually don't even know about LINQ, as long as a data source implements IEnumerable it "just works". From a users point of view the IEnumerable and IQueryable part look exactly the same, which is really nice.

As for comparisons of implementations for actual table objects, we can let benchmarks decide.

Yes, and I'm not sure my approach will be able to get the performance we need. I've made much progress in the last day, but it is still slower than a bitbroadcast. My sense on performance right now is this: so far I have just implemented an iterator that produces a stream of NamedTuples from a DataFrame. But alternatively DataFrame could actually hook into the Queryable interface and then use something like bitbroadcast if the query is simple enough...

@davidanthoff
Copy link
Contributor Author

Here are my thoughts on another topic: scoping of names. I'll first list a couple of approaches:

  • jplyr.jl right now asks you to use column names "plain", and if you want to access a variable from outside the scope of the query, you prefix it with $.
  • jplyr_notes.jl has the same story, but the prefix for variables that are out-of-scope of the query is :.
  • In LINQ.jl you currently need to define an anonymous function for most query operators. The argument to the anonymous function will most likely be a NamedTuple if you iterate a DataFrame. So then you access the columns by something like i.age if i is the name of the argument of the anonymous function.
  • In LINQ proper, you can "reuse" these new variables that represent the current row across query operators, and I hope to add something similar to LINQ.jl once I add a proper query macro. In C# this would look like this:
q = from i in datasource
    where i.age > 45.
    select i.name;

Between the different escape characters, I prefer $ over :.

I like the LINQ approach because in may ways there is actually no escaping going on, it seems pretty clear. It also gets really useful once you start to think about joins, flattening of nested sequences and you can use things like the let keyword in a query to add additional "local" variables. The downside of course is that a) each reference to a column gets slightly more verbose, and b) you have to define the name of the iterator variable somewhere. In my mind a) is not so much of a problem, it certainly seems worth the clarity one gains. b) seems bad if one is only applying say one filter operation, but once a query gets complicated, that cost seems to get less bad. But still, if all you want is just one filter, things get a lot more verbose...

There are probably intermediate solutions that combine some of these ideas...

@johnmyleswhite
Copy link
Contributor

johnmyleswhite commented Aug 2, 2016

I personally think either the LINQ approach or the jplyr approach will be viable. I'm not strongly committed to either. I would, though, like to make sure that an iterator construct is not part of the official semantics since distributed databases (like Presto) can't support it. I'd like to allow iterators to be data sources, but I'd like to make sure the semantics don't assume access to iterators. Other than that concern, my impression is that jplyr and LINQ are very similar and I'd be happy to see either grow into a full solution of our problems.

To me, the real struggle is that we need to flesh out a completely abstract definition of the semantics we want and ensure that the semantics correspond closely enough to SQL that we have some hope of emitting efficient SQL in the future. As an example, I think it's useful to say that a tabular data library behaves as if it were operating on a multiset of named tuples, but I think it's essential that we not say that this is actually how things behave or commit to the use of named tuples. In particular, I very strongly want us to avoid making any commitments about the specific memory representations used, which is why I ultimately don't want to build against DataFrames, which needs to maintain a specific memory representation (in terms of columns being AbstractArray objects) to provide its full API involving row and column indexing. I'm on board with us implementing backends that happen to work that way, but I think that should be a coincidence rather than a given.

On the topic of precise semantics, the scoping issue you get at is really complicated from what I've seen. Abstractly I'd be alright with $ rather than :, but my understanding is that $ introduces Julia's semantics for splicing, which I would like to avoid for now. We might eventually agree that Julia's splicing rules are acceptable, but I would prefer not to pre-commit to them until we're sure that they admit the semantics we want.

Another reason why I consider scoping hard: in my notes, there are already examples in which the scoping issue allows you to write complicated queries that depend on local variables -- but not all of them can be translated into SQL effectively. I think the core issue about how we handle the capture of values is that we can only really support it universally when the value is a run-time identity-less value that has a representation as a SQL literal. So something like col1 = col1 + :x works fine if x is the number 1, but it does not work fine if we do something like col1 = col1 + :x[col1] since :x[col1] isn't invariant with respect to the rows being operated on. This is a place where we might want to tolerate extra functionality that we can get for free for in-memory tables, but also note that this extra functionality depends on behavior that is actually undefined and not portable.

@davidagold and I are slowly writing a completely abstract specification in a Google Doc that I can share with anyone who gives me their e-mail. But attempting to implement the specification reveals just how complicated it is going to be to ensure that we have an easily specified semantics. In particular, we're doing automatic lifting, which makes things really ergonomic since you don't even have to talk about handling nulls -- we just do it for you. But there are lots of functions that have specialized semantics for nulls and we need to come up with an approach to handling them.

@bramtayl
Copy link

bramtayl commented Aug 2, 2016

It might just make sense to have a code translator that can translate back and forth between LINQ, SQL, and native code. Then some sort of query function that can take either LINQ or SQL code and use the desired input type and output type to come up with the right thing to do. +1 for automatic lifting.

@johnmyleswhite
Copy link
Contributor

That probably requires more work to do since we'd have to write all the same semantics support as we would with the other proposals while also adding an additional syntax layer. Not a bad idea in the long run, but probably makes the task about twice as hard.

@bramtayl
Copy link

bramtayl commented Aug 2, 2016

I was under the impression that was what other proposals were basically doing already? For example, DataFramesMeta takes a vaguely SQL/LINQ like syntax and then uses metaprogramming to translate it into native Julia code. It's also very similar to dplyr, which can translate dplyr syntax either into C++ code or SQL code. Riffing off of pandoc, maybe PanQuery?

P.S. the translator could probably support other syntaxes as well, like dplyr or data.table

@davidanthoff
Copy link
Contributor Author

davidanthoff commented Aug 2, 2016

I have just finished a very prototypish implementation of Queryable support for SQLite in LINQ.jl. Take a look at example 04 in the examples folder to get a sense. The code is unbelievably unstable, almost any modification of either the where or select clause should break it at this point.

So at this point I still need to add a macro that actually provides language integrated query, and that is the next step. Once that is done, I should have examples of all the major architectural parts needed. All the implementations are unbelievably buggy and non-robust, but I first want to get an implementation that shows off how the major parts interact.

And just to recap, what is happening in the package: if you query a source that has a QueryProvider, that QueryProvider can rewrite the query in whatever way it wants, in this example the query gets rewritten as a SQL query and executed in SQLite. If the source doesn't have a QueryProvider, everything falls back to an iteration framework.

@davidanthoff
Copy link
Contributor Author

I would, though, like to make sure that an iterator construct is not part of the official semantics since distributed databases (like Presto) can't support it.

The LINQ design uses an iterator design as a conceptual basis, but any QueryProvider can do whatever it likes and never iterate anything. I'm just following the .Net LINQ design here.

As an example, I think it's useful to say that a tabular data library behaves as if it were operating on a multiset of named tuples, but I think it's essential that we not say that this is actually how things behave or commit to the use of named tuples. In particular, I very strongly want us to avoid making any commitments about the specific memory representations used, which is why I ultimately don't want to build against DataFrames, which needs to maintain a specific memory representation (in terms of columns being AbstractArray objects) to provide its full API involving row and column indexing.

Yes, that is the idea of LINQ and how I implemented this in LINQ.jl: when you use a data source like a SQL database you write your queries as if you were using NamedTuples, but in fact no NamedTuple ever needs to be instantiated. Again, this is just LINQ design: there you use anonymous types e.g. in your SELECT clause even though you might never actually instantiate it if everything is run in the DB.

Another reason why I consider scoping hard: in my notes, there are already examples in which the scoping issue allows you to write complicated queries that depend on local variables -- but not all of them can be translated into SQL effectively.

The way .Net LINQ handles this is pretty simple: you can write any valid C# code in your lambdas, but only a subset of that can be used when you have a Queryable data source. If you end up using something in your lambdas that can't be translated into SQL against a SQL data source, you get an error. If you execute this against an iterator/enumerable source, everything works and you can go crazy.

In particular, we're doing automatic lifting, which makes things really ergonomic since you don't even have to talk about handling nulls -- we just do it for you.

I have no good solutions for this, other than hope that julia might add things like the ?? operator from C# down the road. The design in my LINQ.jl implementation doesn't add anything over and above basic julia synatx for dealing with Nullables on that front. Not great, I know...

@davidanthoff
Copy link
Contributor Author

Oh, and @johnmyleswhite please add me to the Google doc, my email is on my github profile page.

@bramtayl
Copy link

bramtayl commented Aug 3, 2016

Ok, well I wrote up a basic framework for a translation based API. See gist here:
https://gist.github.com/bramtayl/c879e2671c0d68b203f961426bc57e49

@johnmyleswhite
Copy link
Contributor

The way .Net LINQ handles this is pretty simple: you can write any valid C# code in your lambdas, but only a subset of that can be used when you have a Queryable data source. If you end up using something in your lambdas that can't be translated into SQL against a SQL data source, you get an error. If you execute this against an iterator/enumerable source, everything works and you can go crazy.

Does this error out after generating SQL and seeing that it fails? (Which would be really easy to implement.) Or does it error out earlier?

@johnmyleswhite
Copy link
Contributor

I have no good solutions for this, other than hope that julia might add things like the ?? operator from C# down the road. The design in my LINQ.jl implementation doesn't add anything over and above basic julia synatx for dealing with Nullables on that front. Not great, I know...

I'm very torn on this issue. On the hand, your solution is clearly much more general. On the other hand, I increasingly suspect that >99% of use cases can be solved using the default lifting strategy that assumes an expression evaluates to NULL if any input is null-valued. Given that people seem to really hate writing a + b as isnull(a) || isnull(b) ? NULL : get(a) + get(b), I think it's worth seeing how often people find that default lifting fails for them. It clearly will fail, but I'm not yet sure if it will fail in so many ways that we can't just patch things up with a few custom operations that depend only upon syntactic properties of quoted expressions.

@davidanthoff
Copy link
Contributor Author

davidanthoff commented Aug 3, 2016

The way .Net LINQ handles this is pretty simple: you can write any valid C# code in your lambdas, but only a subset of that can be used when you have a Queryable data source. If you end up using something in your lambdas that can't be translated into SQL against a SQL data source, you get an error. If you execute this against an iterator/enumerable source, everything works and you can go crazy.
Does this error out after generating SQL and seeing that it fails? (Which would be really easy to implement.) Or does it error out earlier?

It errors out when the QueryProvider tries to translate the query tree into a SQL statement. The provider realizes that the AST for the lambda has some construct that has no equivalent in SQL, and throws an error. So nothing gets sent to the DB because things fail earlier, in the translation phase.

@ararslan
Copy link
Member

ararslan commented Aug 3, 2016

has no equivalent in SQL

So if we were to implement similar functionality in Julia, we would have to know the exact backend we're on, since not all databases support the same SQL constructs (apart from the obvious core clauses). Right?

@davidanthoff
Copy link
Contributor Author

I think one of the core design differences between jplyr and LINQ is that jplyr is more prescriptive about the conceptual data model: as @johnmyleswhite writes, the things that the queries operate on are bags of named tuples (or something like that). LINQ is less prescriptive here: the core IEnumerable interface just says that query operators work on bags of things (of one type), but these don't have to be named tuples. So you can query an array of anything, as long as all the elements in the bag have the same type, or anything else, really.

If you start to query a DB or a DataFrame, you will actually start out with a bag of named tuples, and in the case of a DB, if you stick to named tuples in all your query operators, it can be translated into SQL and executed in the DB. But that just happens to be a special case restrictions that e.g. a specific QueryProvider imposes, it is actually not core to the general idea of LINQ.

That does give you a lot of flexibility: you can in theory use LINQ to query XML, JSON, any custom data structure, deeply nested hierarchical stuff etc.

@davidanthoff
Copy link
Contributor Author

has no equivalent in SQL
So if we were to implement similar functionality in Julia, we would have to know the exact backend we're on, since not all databases support the same SQL constructs (apart from the obvious core clauses). Right?

Well, that is actually an interesting question... I think in .Net they have one QueryProvider that can actually talk to multiple databases, so they have another abstraction layer between the LINQ stuff and concrete SQL databases (the whole entity framework). But this could just be implemented in different ways: either you could have a QueryProvider per DB backend, or you could have some generic SQL QueryProvider that can talk to multiple different DB backends. The latter probably makes more sense because there is probably a lot of similar functionality for different DB ends.

@davidanthoff
Copy link
Contributor Author

davidanthoff commented Aug 3, 2016

@ararslan Essentially, if you look at the code here, this is where I translate the query DAG into SQL. That function is the most hacky, unrobust thing I've ever written, and a proper implementation would detect whether the lambdas that are passed actually contain stuff that can't be supported in the DB and throw errors.

@davidanthoff
Copy link
Contributor Author

@johnmyleswhite I'm not fully up to speed with the whole lifting discussion. But in C#, the compiler provides lifting, somehow? Wouldn't that be the best solution, i.e. have proper lifting semantics in julia itself, and then have something like LINQ.jl just pick up the default language semantics?

@johnmyleswhite
Copy link
Contributor

I don't know if LINQ adds more to the C# spec, but C# only lifts the core arithmetic functions, which means that something like log(a) + b * sqrt(c) actually needs a bunch of manual lifting to work. In contrast, the default approach I'm taking is all contained in these lines: https://github.com/johnmyleswhite/jplyr_notes.jl/blob/123e170c2fdecb2176852a1544bc7b4dca803608/src/mutate/apply_mutate_func.jl#L24

I'm increasingly confident that the number of exceptions to those semantics that are needed in practice is very small. I'd like to break that functionality out since automatic lifting and application of a function to a source of tuples are mostly orthogonal. But my feeling considering how much unhappiness I've seen with the state of Nullable is that we really need to make it easy to operate on nullable data without having to do manual lifting -- assuming we can mostly make it work. Maybe that should happen in the compiler one day, but it can't hurt to demo out the ideas elsewhere first.

@davidanthoff
Copy link
Contributor Author

Yes, I agree, the whole "how do we deal with nullable" debate seems orthogonal to the "what is the core data model for a query framework" question.

I watched @johnmyleswhite juliacon talk from last year now and thought a bit more about the nullable situation. Here are some random comments, I still haven't made up my mind:

  • I'm not a fan of having different lifting semantics in queries and in normal julia code. Maybe fine to play around with ideas in the query stuff, but I think any long term solution really needs to use the same semantics in both places.
  • I like the C# lifting thing, I think minimally it would be great if those were just present in Base. I think one would need more, but I don't see any reason to not have those in Base.
  • I'm really worried about lifting everything. Somehow that just strikes me as wrong. An example where that might actually be undesirable is strings: when I concatenate two strings, and one is null, I might actually think of that as an empty string in some cases, but not in other cases. Seems to be really context specific.
  • I really liked the idea that @johnmyleswhite brought up in last year's talk about call-site lifting, and I have a new syntax idea for that: why not have a syntax akin to the dot vectorization syntax that made it into Base for lifting? For example, log?(x) would call the lifted version of log. I think for infix operators that would be awkward (e.g. x +? b looks odd to me), but that could be solved by just default lifting those, like in C#.

@bramtayl
Copy link

bramtayl commented Aug 5, 2016

I'm not so sure about longer term solutions. But I think in the short term, separate packages for lifting and querying might be good. The querying package could automatically sprinkle in @lift macros from the lifting package? That way other packages could make use of lifting.

@bramtayl
Copy link

bramtayl commented Aug 5, 2016

P.S. I'd like to advocate for making sure standard evaluation versions exist for all macros. This is the close to the strategy hadley uses, and DataFramesMeta makes use of it as well. Simple example:

chain(a, b) = :($a($b))
macro chain(args...)
    chain(args...)
end

@davidagold
Copy link

davidagold commented Aug 5, 2016

Whatever happens, JuliaLang/julia#16961 should be finalized and merged. Then, for simple call-site lifting, map would be sufficient, and a package dedicated to lifting would be most useful operating at the block expression level -- especially for if statements.

EDIT: and, in most cases, a lifting package would then be unnecessary.

@davidanthoff
Copy link
Contributor Author

@davidagold I don't understand the idea about map for call-site lifting. How would I have to write log(x) to work with a nullable?

@davidagold
Copy link

davidagold commented Aug 6, 2016

@davidanthoff The idea is that Nullable is a collection, albeit a specialized one, and hence one ought to be able to map a function over it. See also JuliaLang/julia#9446. So, you'd just write map(log, x), where x is the Nullable argument.

Some context: A big question we'd run into earlier with call-site lifting is how to select the parameter T in the empty Nullable{T}() that would be returned in case one of the arguments passed to a lifted function (with default lifting semantics) is null. This is essentially the same question as what to do with the eltype of the result of mapping over an empty collection (see JuliaLang/julia#11034 and some others that I don't have on hand right now.) JuliaLang/julia#16622 suggests that it may be okay to use type-inference to choose such T, which would let us codify the default lifting semantics in map, if folks decide this is the thing to do.

EDIT: Though hopefully the cases in which one will need to manually lift via map or some other such facility will be minimal -- the querying facilities could deal with lifting in most cases, if it's sufficiently expressive.

@davidanthoff
Copy link
Contributor Author

davidanthoff commented Aug 9, 2016

I've added a query syntax macro to LINQ.jl, take a look here in the example folder.

I think with that I have an example of all the major pieces of a LINQ implementation there. Not one of them is robust, but I think this should be enough for people to get an idea about the whole design.

Some highlights of the existing implementation:

  • there is no escape character used to differentiate between the function variable space and the DataFrame variable space. I think this will be super helpful once joins and nested selects are supported. But even for the simple cases that work right now, it seems cleaner than symbols or escape characters.
  • it works with pretty much anything that can be iterated, i.e. the source doesn't have to be something with a conceptual table format at all, and the destination also doesn't have to be a table like thing. It is actually quite handy to e.g. convert an array of some type into a DataFrame etc.
  • when you start with a data source that has a QueryProvider and you stick to a subset of what you could normally do, your query will be translated into whatever the QueryProvider supports and the QueryProvider can execute it. I have one example of that for SQLite, where the query gets translated into SQL and is then executed in the database (that part is really fragile).
  • When you stick to DataFrames, the conceptual data model is that of a NamedTuple iterator (although in the case of SQLite, you might actually never ever see a NamedTuple).
  • DataFrames right now are an enumerable source, i.e. they get iterated. I've spent a lot of time making that part type-stable, i.e. when you iterate your DataFrame and do things like @where and @select, the whole iteration protocol (the calls to start, next and done) are all type-stable. Performance is worse than the fastest bitbroadcast like implementation, but not terrible at all.

@johnmyleswhite, @davidagold I'd be very interested to hear what you think.

@johnmyleswhite
Copy link
Contributor

@bramtayl: Please stop commenting on this issue until you have done enough serious engineering work for me to believe that your opinion reflects a measured and thoughtful consideration of all of the technical details involved. As is, you are wasting my time.

@davidanthoff
Copy link
Contributor Author

@bramtayl I welcome your comments here. @johnmyleswhite I would appreciate it if we could keep the debate here civil and refrain from personal attacks.

@yeesian
Copy link

yeesian commented Aug 9, 2016

I have been, and will continue to meet up with @davidagold to discuss about the generation of SQL for backends. What I think we can agree on is that we don't yet know enough to commit to a single decision for now, and we don't have that much time left to wrap this up as a summer project for david, before he has to leave cambridge/MIT.

To describe it as a "waste of time" and "doing the same thing" is a passive-aggressive way of insisting on a level-of-commitment/involvement from us that doesn't exist. And John's comment was meant to help shield david from leaving the impression that he doesn't care about feedback. But to make further progress, he needs additional help and support, rather than advice/debate/discussion on what he should do.

@bramtayl
Copy link

bramtayl commented Aug 9, 2016

I didn't mean that as an attack, just a suggestion, admittedly out of a place of ignorance. I would like to contribute to this project in the future, and I'll try to be more respectful in the future.

@davidagold
Copy link

@johnmyleswhite

Abstractly I'd be alright with $ rather than :, but my understanding is that $ introduces Julia's semantics for splicing, which I would like to avoid for now.

I assume you mean that it commits us to Julia's semantics for splicing because of convention, i.e. people will expect the semantics of $ in the querying context to line up with those of $ in string interpolation. Is that correct?

@davidanthoff
Copy link
Contributor Author

I'm having a bit of trouble getting the dev branch to work (where do I find FunctionWrappers?).

The README has the link. I also cleaned it up a bit so that this info is easier to find. Plus, everything is on master now.

@davidanthoff
Copy link
Contributor Author

I suppose you could extend whatever call-site lifting mechanism to cases of mixed (i.e. containing Nullable and non-Nullable types) argument signatures.

Yes, that will definitely be required for my LINQ.jl case. I guess if the dot syntax with broadcast for Nullables doesn't cover this, that would be a good reason to think about some other call-site syntax that deals with this. How are you dealing with mixed arguments in your automatic lifting for jplyr?

@davidanthoff
Copy link
Contributor Author

I added support for simple joins to LINQ.jl, see example 8.

Two (maybe) general interest points from that:

  1. this demonstrates the benefit of not using escape characters quite nicely. By using range variables to deal with name-scoping issue, I can get a pretty elegant way construct the on part of the join clause. Once I implement the SelectMany and let stuff this will be even more clear.
  2. this also shows quite nicely how one can combine different data sources with LINQ.jl, i.e. in this case I'm joining a DataFrame with a TypedTable and produce a DataFrame at the end. Of course the whole thing would equally work if one wanted to join a table-like data structure with something that is not a table, like a Dict or anything else that can be iterated.

@davidanthoff
Copy link
Contributor Author

Oh, and finally: this blog has a 17 part series on how to build a LINQ IQueryable provider. It is a really, really fascinating read. The problem that is described there is essentially how you go from a C# query expression tree (like a julia expression tree) to a SQL statement. Slightly scary how complicated the whole thing is...

@quinnj
Copy link
Member

quinnj commented Aug 11, 2016

17-part series.......... 😱

@davidagold
Copy link

davidagold commented Aug 11, 2016

I don't know if there's any reason to suppose that broadcast will cover these cases. This would require a proliferation of methods. An alternative is to hack the promote system, but I suspect this will run into extensibility issues.

As for our handling of mixed lifting in jplyr, it is similar to the strategy of handling normal lifting. For instance, suppose a user invokes

@query filter(tbl, a > .5)

where :a is the field of some NullableArray column. Essentially, we generate a function f = (x,) -> x >. 5 and do something like

for row in row_itr_over_relevant_columns # in this case (tbl[:a],)
    if !hasnulls(row) && f(map(get, row))
        push!(row, result)
    end
    result
end

(Note: this isn't quite what actually happens atm: see here.) We can do any sort of lifting -- mixed or otherwise -- this way, so long as the function to be (mixed) lifted (in this case x -> x > .5) is expressed within the context of @query. The trick then is to provide enough functionality in @query so that users never have to interact with missing data outside of the macro. Whether or not that's possible remains to be seen.

this demonstrates the benefit of not using escape characters quite nicely. By using range variables to deal with name-scoping issue, I can get a pretty elegant way construct the on part of the join clause.

I don't understand. There are no names "external" to either df1 or df2 in play in your join example. What scoping issues are you referring to? Without an escape character, how will you do something like

for c in (1, 2, 3)
    @from i in in df1 begin
    @where i.age > esc(c)
    @select ...     
end ...

?

@davidanthoff
Copy link
Contributor Author

Good point, I should add an example that actually shows how that works, but that query would simply be:

for c in (1, 2, 3)
    @from i in df1 begin
        @where i.age > c
        @select i
    end
end

Essentially you are always in the julia scope, and the @from i in df1 clause introduces a new range variable that you can use to access the namespace of the DataFrame. This gets handy when you have two separate sources involved, like in a join. They might, e.g., have columns with the same name. By using a different range variable for each source, you can easily avoid namespace clashes between the columns of your sources (and of course the julia namespace).

@andyferris
Copy link
Member

Very interesting read here - sorry I've been absent from these discussion since JuliaCon.

And I just added support for TypedTables.jl (CC @andyferris) as a query source to LINQ.jl, so one can now query DataFrames, TypedTables, SQLite (brittle), arrays and any iterable. It should be relatively easy to add a collect method that creates a TypedTable, I just haven't come around to that.

Cool! I'd like to check that out, but that might be another few days at this rate :(

Anyway, I think the discussion here has been pretty constructive. I particularly agree that we should think of tables as bags/sets of named tuples, but that we have no specification for how either the named tuple or the collection/bag/set is implemented. In the case where we don't care abound the implementation, the way things work efficiently in AbstractArray is to use map(), reduce(), broadcast(), mapreduce(), etc. That way you can use in-memory arrays, ArrayFire, distrubuted arrays, memory mapped arrays, etc. I'm wondering if there is a minimal set of functions (like map()) that is sufficient to implement, e.g., an SQL query? I guess this is what Queryable is meant to provide?

About nullables, I have previously been strongly in the camp of manual lifting like Swift (or I assume C# from what I'm reading here). A better API in Base would go a long way. f?() is getting a bit much when we will also need f?.() and so-forth. But say ?? with sufficient logic would be useful - I've been meaning to post about my ideas for this at some point. On the other hand, automatically lifting throughout all of Julia would also be OK. I just want consistency :)

@davidagold
Copy link

@davidanthoff Ah, of course. That makes a lot of sense, and is very handy. Though I don't think it'll be too hard to avoid a clash in something like

@query inner_join(tbl1, tbl2, by = (a = f(b)))

by following a convention that the name on the LHS (i.e. a) resolves to a column in the first table argument (tbl1) and the name on the RHS (b) resolves to a column in the second table argument (tbl2).

@johnmyleswhite
Copy link
Contributor

I just want consistency :)

I think the crucial difficulty is that there are three sets of semantics to be considered: the semantics of non-nullable Julia, the semantics of nullable Julia and the semantics of SQL. You're never going to achieve full consistency, so you have to pick the cases where you're willing to make sacrifices.

@davidanthoff
Copy link
Contributor Author

So what actually would happen with this in jplyr:

@query filter(tbl, isnull(a))

Or for that matter any filter expression that calls a function that knows how to deal with Nullable?

@davidagold
Copy link

An excellent question =p

Right now, it'd break. In the future, we will support a (hopefully limited)
set of special functions whose semantics depend on access to the Nullable
container. isnull will be amongst them. Or, perhaps we provide a special
syntax that a user can use to denote that they wish their function to
interact with the Nullable container. (I think I'd prefer to avoid this
latter option if possible.)

We are certainly making a wager, namely that the space of situations in
which one will need to deploy special semantics that cannot be handled with
a small set of special-cased functions (such as isnull) is very small, and
hence that the aggregate inconvenience of finding workarounds in those
cases is less than the aggregate inconvenience of manually
lifting/unwrapping/casting in general.

Of course, perhaps some day small Union types will be performant and we
will be able to go back to DataArrays and NAs. =p

On Thursday, August 11, 2016, David Anthoff notifications@github.com
wrote:

So what actually would happen with this in jplyr:

@query filter(tbl, isnull(a))

Or for that matter any filter expression that calls a function that knows
how to deal with Nullable?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1025 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALUCm-aGfL63WDID_TdI4MxYDmf8lVLLks5qe4OPgaJpZM4Ja2tC
.

@johnmyleswhite
Copy link
Contributor

David and I have spent a lot of time thinking about that kind of issue. It's very troubling, but should be soluble if you believe lifting almost works in the same way, except for a small blacklist of functions. The other approach strikes me as essentially a whitelist, in which you list all functions that need to be lifted. It's not clear to me which is superior.

But what really keeps me up at night is you can produce frightening examples in the other direction when you allow Nullable semantics to diverge from the non-nullable semantics.

For example, suppose I do the following:

Base.+(x::Nullable{Int}, y::Nullable{Int} = x.value * y.value

@select(tbl, x + y)

What does this do when run on a pure Julia data structure? What SQL does it translate into?

@davidagold
Copy link

davidagold commented Aug 11, 2016

But what really keeps me up at night is you can produce frightening examples in the other direction when you allow Nullable semantics to diverge from the non-nullable semantics.

Regarding this comment and the example John gives, I had a funny moment in which I really wanted to say something like,

"Well, one solution is to regard Nullable as an implementation detail of a tabular data structure, say Table, that allows for missing values, and actually exclude it from the semantics of Tables. Thus, any user-originated redefinitions of common operators over Nullable arguments are irrelevant. We intend for all of user interactions with a Table to take place through the querying commands we provide. If you ever need to materialize a row as a tuple of literals, (read: take the data out of the Table) you can do so via collect(qry, as=Tuple), which will return a tuple containing the data from the selected row translated from Nullable form into either the unwrapped values (if the data are present) or a special token NA object if the data are missing..."

Then I thought, hey, wait a minute... =p

But honestly, (and I suppose ironically given my work on NullableArrays) I don't think the above solution is all that far-fetched. If we do a good job with @query, there's no reason a tuple materialized by the means described above should ever enter into performance critical code, and so the impetuses (impeti? No, that's not right...) for avoiding NA aren't present. It's a weird situation to be in.

The use of Nullable should be an implementation detail. I'd almost go so far as to say that the fact that user definitions like Base.+(x::Nullable{Int}, y::Nullable{Int}) = x.value * y.value could possibly cause semantic upheaval means we're doing something wrong...

EDIT: "data" is plural.

@davidagold
Copy link

Also, I know probably most people won't care about this, but I just realized that what I said earlier,

Second, it's also a performance issue. The lifting that we do in jplyr involves operating directly on the values field of a NullableArray column. It's nice to avoid indexing into a NullableArray and then checking whether or not the resultant Nullable is empty, and this pretty much involves "automatic" lifting.

is not correct. We do index into the NullableArray columns directly. I think I made that comment because that is how map over NullableArrays works, and it is an optimization I would like to put in place one day, at the very least when it is known that an entire column is not null. For whatever that's worth.

@davidanthoff
Copy link
Contributor Author

Some updates:

  • I renamed my package to Query.jl. There really is no language integrated query here, it is all based on macros, so LINQ seemed the wrong name.
  • I will go with the whitelist approach to lifting. I created a new package called Lifts.jl where I hope we can assemble lifted versions of basic functions for now. It is a bit bare right now :) My hope is to cover at least the set that C# defines there, and then go from there. Having a separate package for this allows me to keep any consideration of lifting out of the basic design of Query.jl.
  • I keep adding more of the LINQ statements, so the list of examples is also expanding. I think at this point the whole thing is pretty usable already. In any case, I think this is past the "lets see whether this makes any sense" phase, I'm pretty convinced that the approach will work all the way through and be useful, so I plan to make this a proper package, not just a proof of concept.
  • Installation is a lot easier, and should soon not require any steps.
  • Apart from finishing out the LINQ query operators, I plan to add support for more sources, in particular for any DataStreams.jl source. Once @quinnj has cleaned up the whole package scene there it should be really easy to just support all of these sources in one go.

Generally, any feedback or help (PRs!) would be really welcome. The starting point for anyone interested should be the example folder.

@bramtayl
Copy link

bramtayl commented Aug 16, 2016

I've put up a very small package, PanQuery.jl (not another one!). It has an absolutely bare-bones implementation of an Abstract Query Language (AQL). It's basically a simplified version of the graphs in jplyr. The idea is that packages would define their own methods to convert back and forth between Queries in various languages, like Julia, SQL, LINQ, AQL, dplyr, or data.table. Currently, only one such method is implemented, Julia -> AQL.

@davidanthoff
Copy link
Contributor Author

I've added support for DataStreams.jl (@quinnj) and NDSparseData.jl (@JeffBezanson) sources to Query.jl. I've also mapped out the issues that still need to be resolved for an initial version, the whole thing at this point seems pretty manageable, so I'm optimistic that we can have a fully implemented version of LINQ for julia soonish :) There is lots of performance work left to be done, but things look reasonable right now and I'll start to look at these optimizations later.

@davidanthoff
Copy link
Contributor Author

Query.jl now has a complete and functional implementation of the in-memory data source part of LINQ as in the C# spec, with a couple extra things here and there. There is one prototype for a data source that queries SQLite via query translation, but that is at best a prototype.

My goal is to release and announce the package soon with the current functionality for in-memory sources (but this includes e.g. any DataStream source, e.g. CSV etc.). If some folks from this thread want to take the package for a test drive and report any feedback before I announce it more widely, I would greatly appreciate it.

I've started with documentation, but for now the best way to learn about the package it to look at the example folder, and then pretty much any online article about LINQ should more or less apply.

In terms of remaining work, there is lots:

  • Performance... I have done some, but not much around that.
  • I need to come up with a better way to select columns if one is using a data source that has a concept of columns. I have a whole bunch of ideas for that, this is mostly syntactic sugar for the functionality that is already there.
  • A working query provider for SQL. That is almost certainly the biggest and most complicated thing on this list.
  • Some more ways to deal with results. For example collecting into TypedTables, DataStream sinks etc. I also want to add something so that one can append to an existing data structure etc. Finally, I might add some convenient syntax so that one can have a continuation like code block that gets run with the result of a query, but gets that in a type-stable way. Should be fairly straightforward, but one would end up with something resembling node style callbacks...

@davidanthoff
Copy link
Contributor Author

I'm just waiting for my v0.1.0 tag to be merged in METADATA and then I'll announce Query.jl on the mailing lists. Any last minute feedback if you have used the package would be most welcome! In particular if you think there is something pressing that needs to be resolved before I announce it widely.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

Closing as Query.jl is now published and stable.

@quinnj quinnj closed this as completed Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants