-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion of current dplyr and LINQ like APIs #1025
Comments
@davidagold had a couple of questions about LINQ.jl in a discussion over there, I'll respond to them here:
Yes, I'm starting to work on this. The original LINQ architecture has two parts: an
I think this is one of the differences: LINQ has one part that lowers things to graphs, but another one that is just based on enumerables/iterators. The nice thing about that is that it works over data sources that actually don't even know about LINQ, as long as a data source implements
Yes, and I'm not sure my approach will be able to get the performance we need. I've made much progress in the last day, but it is still slower than a |
Here are my thoughts on another topic: scoping of names. I'll first list a couple of approaches:
q = from i in datasource
where i.age > 45.
select i.name; Between the different escape characters, I prefer I like the LINQ approach because in may ways there is actually no escaping going on, it seems pretty clear. It also gets really useful once you start to think about joins, flattening of nested sequences and you can use things like the There are probably intermediate solutions that combine some of these ideas... |
I personally think either the LINQ approach or the jplyr approach will be viable. I'm not strongly committed to either. I would, though, like to make sure that an iterator construct is not part of the official semantics since distributed databases (like Presto) can't support it. I'd like to allow iterators to be data sources, but I'd like to make sure the semantics don't assume access to iterators. Other than that concern, my impression is that jplyr and LINQ are very similar and I'd be happy to see either grow into a full solution of our problems. To me, the real struggle is that we need to flesh out a completely abstract definition of the semantics we want and ensure that the semantics correspond closely enough to SQL that we have some hope of emitting efficient SQL in the future. As an example, I think it's useful to say that a tabular data library behaves as if it were operating on a multiset of named tuples, but I think it's essential that we not say that this is actually how things behave or commit to the use of named tuples. In particular, I very strongly want us to avoid making any commitments about the specific memory representations used, which is why I ultimately don't want to build against DataFrames, which needs to maintain a specific memory representation (in terms of columns being On the topic of precise semantics, the scoping issue you get at is really complicated from what I've seen. Abstractly I'd be alright with Another reason why I consider scoping hard: in my notes, there are already examples in which the scoping issue allows you to write complicated queries that depend on local variables -- but not all of them can be translated into SQL effectively. I think the core issue about how we handle the capture of values is that we can only really support it universally when the value is a run-time identity-less value that has a representation as a SQL literal. So something like @davidagold and I are slowly writing a completely abstract specification in a Google Doc that I can share with anyone who gives me their e-mail. But attempting to implement the specification reveals just how complicated it is going to be to ensure that we have an easily specified semantics. In particular, we're doing automatic lifting, which makes things really ergonomic since you don't even have to talk about handling nulls -- we just do it for you. But there are lots of functions that have specialized semantics for nulls and we need to come up with an approach to handling them. |
It might just make sense to have a code translator that can translate back and forth between LINQ, SQL, and native code. Then some sort of query function that can take either LINQ or SQL code and use the desired input type and output type to come up with the right thing to do. +1 for automatic lifting. |
That probably requires more work to do since we'd have to write all the same semantics support as we would with the other proposals while also adding an additional syntax layer. Not a bad idea in the long run, but probably makes the task about twice as hard. |
I was under the impression that was what other proposals were basically doing already? For example, DataFramesMeta takes a vaguely SQL/LINQ like syntax and then uses metaprogramming to translate it into native Julia code. It's also very similar to dplyr, which can translate dplyr syntax either into C++ code or SQL code. Riffing off of pandoc, maybe PanQuery? P.S. the translator could probably support other syntaxes as well, like dplyr or data.table |
I have just finished a very prototypish implementation of Queryable support for SQLite in LINQ.jl. Take a look at example 04 in the examples folder to get a sense. The code is unbelievably unstable, almost any modification of either the So at this point I still need to add a macro that actually provides language integrated query, and that is the next step. Once that is done, I should have examples of all the major architectural parts needed. All the implementations are unbelievably buggy and non-robust, but I first want to get an implementation that shows off how the major parts interact. And just to recap, what is happening in the package: if you query a source that has a |
The LINQ design uses an iterator design as a conceptual basis, but any QueryProvider can do whatever it likes and never iterate anything. I'm just following the .Net LINQ design here.
Yes, that is the idea of LINQ and how I implemented this in LINQ.jl: when you use a data source like a SQL database you write your queries as if you were using
The way .Net LINQ handles this is pretty simple: you can write any valid C# code in your lambdas, but only a subset of that can be used when you have a Queryable data source. If you end up using something in your lambdas that can't be translated into SQL against a SQL data source, you get an error. If you execute this against an iterator/enumerable source, everything works and you can go crazy.
I have no good solutions for this, other than hope that julia might add things like the |
Oh, and @johnmyleswhite please add me to the Google doc, my email is on my github profile page. |
Ok, well I wrote up a basic framework for a translation based API. See gist here: |
Does this error out after generating SQL and seeing that it fails? (Which would be really easy to implement.) Or does it error out earlier? |
I'm very torn on this issue. On the hand, your solution is clearly much more general. On the other hand, I increasingly suspect that >99% of use cases can be solved using the default lifting strategy that assumes an expression evaluates to |
It errors out when the QueryProvider tries to translate the query tree into a SQL statement. The provider realizes that the AST for the lambda has some construct that has no equivalent in SQL, and throws an error. So nothing gets sent to the DB because things fail earlier, in the translation phase. |
So if we were to implement similar functionality in Julia, we would have to know the exact backend we're on, since not all databases support the same SQL constructs (apart from the obvious core clauses). Right? |
I think one of the core design differences between jplyr and LINQ is that jplyr is more prescriptive about the conceptual data model: as @johnmyleswhite writes, the things that the queries operate on are bags of named tuples (or something like that). LINQ is less prescriptive here: the core If you start to query a DB or a DataFrame, you will actually start out with a bag of named tuples, and in the case of a DB, if you stick to named tuples in all your query operators, it can be translated into SQL and executed in the DB. But that just happens to be a special case restrictions that e.g. a specific QueryProvider imposes, it is actually not core to the general idea of LINQ. That does give you a lot of flexibility: you can in theory use LINQ to query XML, JSON, any custom data structure, deeply nested hierarchical stuff etc. |
Well, that is actually an interesting question... I think in .Net they have one QueryProvider that can actually talk to multiple databases, so they have another abstraction layer between the LINQ stuff and concrete SQL databases (the whole entity framework). But this could just be implemented in different ways: either you could have a QueryProvider per DB backend, or you could have some generic SQL QueryProvider that can talk to multiple different DB backends. The latter probably makes more sense because there is probably a lot of similar functionality for different DB ends. |
@ararslan Essentially, if you look at the code here, this is where I translate the query DAG into SQL. That function is the most hacky, unrobust thing I've ever written, and a proper implementation would detect whether the lambdas that are passed actually contain stuff that can't be supported in the DB and throw errors. |
@johnmyleswhite I'm not fully up to speed with the whole lifting discussion. But in C#, the compiler provides lifting, somehow? Wouldn't that be the best solution, i.e. have proper lifting semantics in julia itself, and then have something like LINQ.jl just pick up the default language semantics? |
I don't know if LINQ adds more to the C# spec, but C# only lifts the core arithmetic functions, which means that something like I'm increasingly confident that the number of exceptions to those semantics that are needed in practice is very small. I'd like to break that functionality out since automatic lifting and application of a function to a source of tuples are mostly orthogonal. But my feeling considering how much unhappiness I've seen with the state of |
Yes, I agree, the whole "how do we deal with nullable" debate seems orthogonal to the "what is the core data model for a query framework" question. I watched @johnmyleswhite juliacon talk from last year now and thought a bit more about the nullable situation. Here are some random comments, I still haven't made up my mind:
|
I'm not so sure about longer term solutions. But I think in the short term, separate packages for lifting and querying might be good. The querying package could automatically sprinkle in |
P.S. I'd like to advocate for making sure standard evaluation versions exist for all macros. This is the close to the strategy hadley uses, and DataFramesMeta makes use of it as well. Simple example: chain(a, b) = :($a($b))
macro chain(args...)
chain(args...)
end |
Whatever happens, JuliaLang/julia#16961 should be finalized and merged. Then, for simple call-site lifting, EDIT: and, in most cases, a lifting package would then be unnecessary. |
@davidagold I don't understand the idea about |
@davidanthoff The idea is that Some context: A big question we'd run into earlier with call-site lifting is how to select the parameter EDIT: Though hopefully the cases in which one will need to manually lift via |
I've added a query syntax macro to LINQ.jl, take a look here in the example folder. I think with that I have an example of all the major pieces of a LINQ implementation there. Not one of them is robust, but I think this should be enough for people to get an idea about the whole design. Some highlights of the existing implementation:
@johnmyleswhite, @davidagold I'd be very interested to hear what you think. |
@bramtayl: Please stop commenting on this issue until you have done enough serious engineering work for me to believe that your opinion reflects a measured and thoughtful consideration of all of the technical details involved. As is, you are wasting my time. |
@bramtayl I welcome your comments here. @johnmyleswhite I would appreciate it if we could keep the debate here civil and refrain from personal attacks. |
I have been, and will continue to meet up with @davidagold to discuss about the generation of SQL for backends. What I think we can agree on is that we don't yet know enough to commit to a single decision for now, and we don't have that much time left to wrap this up as a summer project for david, before he has to leave cambridge/MIT. To describe it as a "waste of time" and "doing the same thing" is a passive-aggressive way of insisting on a level-of-commitment/involvement from us that doesn't exist. And John's comment was meant to help shield david from leaving the impression that he doesn't care about feedback. But to make further progress, he needs additional help and support, rather than advice/debate/discussion on what he should do. |
I didn't mean that as an attack, just a suggestion, admittedly out of a place of ignorance. I would like to contribute to this project in the future, and I'll try to be more respectful in the future. |
I assume you mean that it commits us to Julia's semantics for splicing because of convention, i.e. people will expect the semantics of |
The README has the link. I also cleaned it up a bit so that this info is easier to find. Plus, everything is on |
Yes, that will definitely be required for my LINQ.jl case. I guess if the dot syntax with |
I added support for simple joins to LINQ.jl, see example 8. Two (maybe) general interest points from that:
|
Oh, and finally: this blog has a 17 part series on how to build a LINQ IQueryable provider. It is a really, really fascinating read. The problem that is described there is essentially how you go from a C# query expression tree (like a julia expression tree) to a SQL statement. Slightly scary how complicated the whole thing is... |
17-part series.......... 😱 |
I don't know if there's any reason to suppose that As for our handling of mixed lifting in @query filter(tbl, a > .5) where for row in row_itr_over_relevant_columns # in this case (tbl[:a],)
if !hasnulls(row) && f(map(get, row))
push!(row, result)
end
result
end (Note: this isn't quite what actually happens atm: see here.) We can do any sort of lifting -- mixed or otherwise -- this way, so long as the function to be (mixed) lifted (in this case
I don't understand. There are no names "external" to either for c in (1, 2, 3)
@from i in in df1 begin
@where i.age > esc(c)
@select ...
end ... ? |
Good point, I should add an example that actually shows how that works, but that query would simply be: for c in (1, 2, 3)
@from i in df1 begin
@where i.age > c
@select i
end
end Essentially you are always in the julia scope, and the |
Very interesting read here - sorry I've been absent from these discussion since JuliaCon.
Cool! I'd like to check that out, but that might be another few days at this rate :( Anyway, I think the discussion here has been pretty constructive. I particularly agree that we should think of tables as bags/sets of named tuples, but that we have no specification for how either the named tuple or the collection/bag/set is implemented. In the case where we don't care abound the implementation, the way things work efficiently in About nullables, I have previously been strongly in the camp of manual lifting like Swift (or I assume C# from what I'm reading here). A better API in Base would go a long way. |
@davidanthoff Ah, of course. That makes a lot of sense, and is very handy. Though I don't think it'll be too hard to avoid a clash in something like @query inner_join(tbl1, tbl2, by = (a = f(b))) by following a convention that the name on the LHS (i.e. |
I think the crucial difficulty is that there are three sets of semantics to be considered: the semantics of non-nullable Julia, the semantics of nullable Julia and the semantics of SQL. You're never going to achieve full consistency, so you have to pick the cases where you're willing to make sacrifices. |
So what actually would happen with this in jplyr: @query filter(tbl, isnull(a)) Or for that matter any filter expression that calls a function that knows how to deal with |
An excellent question =p Right now, it'd break. In the future, we will support a (hopefully limited) We are certainly making a wager, namely that the space of situations in Of course, perhaps some day small Union types will be performant and we On Thursday, August 11, 2016, David Anthoff notifications@github.com
|
David and I have spent a lot of time thinking about that kind of issue. It's very troubling, but should be soluble if you believe lifting almost works in the same way, except for a small blacklist of functions. The other approach strikes me as essentially a whitelist, in which you list all functions that need to be lifted. It's not clear to me which is superior. But what really keeps me up at night is you can produce frightening examples in the other direction when you allow For example, suppose I do the following: Base.+(x::Nullable{Int}, y::Nullable{Int} = x.value * y.value
@select(tbl, x + y) What does this do when run on a pure Julia data structure? What SQL does it translate into? |
Regarding this comment and the example John gives, I had a funny moment in which I really wanted to say something like, "Well, one solution is to regard Then I thought, hey, wait a minute... =p But honestly, (and I suppose ironically given my work on The use of EDIT: "data" is plural. |
Also, I know probably most people won't care about this, but I just realized that what I said earlier,
is not correct. We do index into the |
Some updates:
Generally, any feedback or help (PRs!) would be really welcome. The starting point for anyone interested should be the |
I've put up a very small package, PanQuery.jl (not another one!). It has an absolutely bare-bones implementation of an Abstract Query Language (AQL). It's basically a simplified version of the graphs in jplyr. The idea is that packages would define their own methods to convert back and forth between Queries in various languages, like Julia, SQL, LINQ, AQL, dplyr, or data.table. Currently, only one such method is implemented, Julia -> AQL. |
I've added support for DataStreams.jl (@quinnj) and NDSparseData.jl (@JeffBezanson) sources to Query.jl. I've also mapped out the issues that still need to be resolved for an initial version, the whole thing at this point seems pretty manageable, so I'm optimistic that we can have a fully implemented version of LINQ for julia soonish :) There is lots of performance work left to be done, but things look reasonable right now and I'll start to look at these optimizations later. |
Query.jl now has a complete and functional implementation of the in-memory data source part of LINQ as in the C# spec, with a couple extra things here and there. There is one prototype for a data source that queries SQLite via query translation, but that is at best a prototype. My goal is to release and announce the package soon with the current functionality for in-memory sources (but this includes e.g. any DataStream source, e.g. CSV etc.). If some folks from this thread want to take the package for a test drive and report any feedback before I announce it more widely, I would greatly appreciate it. I've started with documentation, but for now the best way to learn about the package it to look at the example folder, and then pretty much any online article about LINQ should more or less apply. In terms of remaining work, there is lots:
|
I'm just waiting for my v0.1.0 tag to be merged in METADATA and then I'll announce Query.jl on the mailing lists. Any last minute feedback if you have used the package would be most welcome! In particular if you think there is something pressing that needs to be resolved before I announce it widely. |
Closing as Query.jl is now published and stable. |
Nice tutorial based on dplyr Flights Tutorials https://www.juliabloggers.com/data-wrangling-in-julia-based-on-dplyr-flights-tutorials/ Re-posted from: https://cbrownley.wordpress.com/2017/11/29/data-wrangling-in-julia-based-on-dplyr-flights-tutorials/ |
There are a bunch of efforts under way in this area, and this issue can serve as a discussion place for any cross-cutting issues. Currently, there seem to be these efforts (let me know if there are more and I'll add them):
There are various writeups that people should be familiar with, e.g.
The text was updated successfully, but these errors were encountered: