-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add crossjoin #531
Add crossjoin #531
Conversation
Why not use |
I didn't like the ambiguity about what to do with
That said, one less export would be nice -- I just didn't think of anything that struck me as clean enough. Separately, I thought of defining |
Rebased against master. |
Sorry, I don't understand you second point. Why wouldn't |
I was just tying to say that either join type would be vary by argument type ( As far as implementing the change, I'm not sure if it's possible to define a method with a keyword argument that only takes a specific symbol ( |
Actually I had the redundant syntax in mind in my first comment. I'd say throw an error if tuples are passed and |
I believe that would just involve an extra level of indirection for all crossjoin So, a little unfortunate clutter under the hood, but happy to do it if there's consensus that the API would be improved. P.S. I left out a major consideration - if we extended join((:a, 1:2), (:b, 2:3))
"a(:b,2:3)1\n2\n" ;) |
Do we already define |
Other than my point about tuples, the code here looks good. For tuples, why not just this something closer to In general, I've noticed a few times recently that we'd benefit from an abstraction that takes an iterator that produces rows of data and collects them together. |
SQL does cross joins only for tables because that's all they have and on the other end of the spectrum, I'd lean towards moving this functionality to a |
Let's think about this a bit more. My hesitation mostly reflects the fact that Iterators.jl already provides a tool for providing Cartesian products as an iterable, so it seems like we could provide a tool that turns those iterables into a DataFrame. The value of that approach is that we'd get all the other iterables at the same time. Partly I bring this up because I recently wrote a function that works a lot like your I feel like there's a set of powerful abstractions underlying all of this that we could combine into something really interesting. |
All I can say is that it seemed nice to have |
That makes a lot of sense, @johnmyleswhite. I'll think about it, too. @nalimilan Half of me agrees with you about the name, but I'm pulled the other way because 'cross join' was a strange name for the operation of the first place, it's since it's no more similar to the 'main' joins than than is |
My three cents:
If we stopped supporting natural joins, I think it would be easy to support keywords that switch between |
Fully agreed. Natural join is dangerous and people can very easily take the intersection of column identifiers if that's really what they want. |
Just to be clear, we're still supporting natural joins in the case where there is one matching column name (throwing if there's more), which still could be a little risky, I just saw it as a reasonable compromise given it was already supported -- so the decision would be to get rid of that. I've been on the line about putting Are you thinking we should get rid of |
Yeah, I am proposing getting rid of the existing column matching. Sorry for forgetting |
At the risk of being pedantically repetitive, the tuples case of crossjoin already does exist as |
I'm going to open another issue to discuss generalizations of |
The changes sound good to me. I'll test my thoughts on the shortcomings of an unextended |
Sorry about that, wrong button. @johnmyleswhite, my worries about limiting ourselves with respect to cartesian products seems to fit better here than the new issue, so (sorry, it's long):
Allowing only two DataFrames in a cross join means inefficiency: cj() = crossjoin(df1, df2, df3)
cjcj() = crossjoin(crossjoin(df1, df2), df3)
compares([cj, cjcj], 100_000)
Bytes:
3880
6400
2x7 DataFrame
|-------|-------|----------|------|----------|----------|--------|---------|
| Row # | Split | Function | Rank | Relative | Mean | Reps | Elapsed |
| 1 | 1 | "cj" | 1 | 1.0 | 1.525e-5 | 100000 | 1.52514 |
| 2 | 1 | "cjcj" | 2 | 1.482 | 2.261e-5 | 100000 | 2.26056 |
|
You're right: there's some definite efficiency issues with One big concern I have is that I don't understand why Sorry to be such an impediment in this issue. I'm just trying to clamp down on functionality that I'm not 100% confident we'll want to support a year from now. I feel like there's a lot of good ideas in this PR, but that, aside from pure DataFrame joining, they're not quite in their final form yet. |
Yeah, that makes sense, my viewpoint is that Supporting only DataFrames would cover use cases as long as it took more than two arguments, but it would be verbose to convert each vector to a DataFrame, and currently much slower: Bytes:
26432
54008
2x7 DataFrame
|-------|-------|----------|------|----------|-----------|-------|---------|
| Row # | Split | Function | Rank | Relative | Mean | Reps | Elapsed |
| 1 | 1 | "cjt" | 1 | 1.0 | 2.22e-5 | 10000 | 0.22199 |
| 2 | 1 | "cjdf" | 2 | 8.455 | 0.0001877 | 10000 | 1.87696 |
For what it's worth, I mentioned above that I wanted keywords. I went with tuples because I didn't know how to determine the order of all args in the call (only the order of non-keyword args separately from the order of kwargs), so if I used kwargs, something like I don't see as many cases here as in other frameworks I've used where relying on sorting, etc, is important to efficient code, apart from perhaps creating grouped data frames differently if data is already sorted by the keys. But I imagine when indexing comes back in to play there may be more cases where having things sorted is helpful for more than just looks and display. Also, I thought tuples would be easier and cleaner for writing UDFs that cross join on an unknown amount of arguments with unknown names, but I wondered if that was just me not knowing my way around Julia well enough yet. |
I find the inconsistency of using tuples only here unappealing, too. Here are my thoughts so far on related operations elsewhere in the codebase: If (Side note -- right now I feel like there might be some parallel with having fast spit-apply-combine operations without ending up with random col names, too, but I haven't thought through the mechanics. |
Something like And you're totally right: I do see cross joins as a kind of join. One question: is it ever possible in SQL to do a join between more than two tables with a single call to join? I've only ever used join as a binary operator, but that may just be my own idiosyncratic experiences. If it's true that join is always a binary operator in SQL, I don't think we need |
Regarding tuples: why not use dictionaries instead? I'm a lot more comfortable with that. |
Regarding your points about
While doing this, we should push for fewer new abstractions. Right now there's some weird overlap between |
Regarding Regarding cross joins only being binary: Regarding tuples vs dicts: I hadn't considered dictionaries at all, but I don't know why not. |
I guess my question is: when will doing two separate |
So I do agree with you: we should implement something like |
I think it's safe to say no, but you could say that about a lot of operations and though you'd be right for most workflows, it only seems a valid factor for consideration when efficiency is at odds with factors like consistency of API and simplification of user code. crossjoin(df1, df2, df3)
# or
join(df1, df2, df3, kind = :cross)
# seem much nicer to read and write than and fit easier into an expression than
join(join(df1, df2, kind = :cross), df3, kind = :cross)
# the same goes for
crossjoin(df1, df2, [:var => 1:9])
# or
crossjoin(df1, df2, (:var, 1:9))
# or
join(df1, df2, (:var, 1:9), kind = :cross)
# vs
join(join(df1, df2, kind = :cross), DataFrame(var = 1:9), kind = :cross) Whether it's better or worse for consistency of API depends on how we deal elsewhere with avoiding generating variables with random names and separately tracking down and renaming them. |
Okay, that makes complete sense to me. I'll add it to |
I just put #536 to make the suggested updates to join. I'm going to close this one because it's a little muddled with the general question of how to specify variable names when passing something other than a DataFrame. @johnmyleswhite Thanks for fleshing things out with all the back and forth! I regret being such a time suck, but there were some design questions I was struggling with, and I have a clear idea what they actually are now. |
Adds
crossjoin
(e.g.crossjoin(df1, df2)
orcrossjoin((:A, [1,2]), (:B, [2, 3]), (:C, 'A':'z'))
).On one hand it's pretty easy for a user to implement. On the other, it's in ANSI SQL, and I use it pretty often in the form of
data.table
'sCJ
-- depending on the kind of data you work with, you may never use it.