Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add crossjoin #531

Closed
wants to merge 1 commit into from
Closed

Add crossjoin #531

wants to merge 1 commit into from

Conversation

garborg
Copy link
Contributor

@garborg garborg commented Feb 7, 2014

Adds crossjoin (e.g. crossjoin(df1, df2) or crossjoin((:A, [1,2]), (:B, [2, 3]), (:C, 'A':'z'))).

On one hand it's pretty easy for a user to implement. On the other, it's in ANSI SQL, and I use it pretty often in the form of data.table's CJ -- depending on the kind of data you work with, you may never use it.

@nalimilan
Copy link
Member

Why not use join(..., kind = :cross) for that?

@garborg
Copy link
Contributor Author

garborg commented Feb 7, 2014

I didn't like the ambiguity about what to do with join(df, df):

  • :cross if no common names, :inner if one common name, throw otherwise? -- seemed too opaque and error prone.
  • Always :inner? -- seems sensible but then syntax that works for join(df, (:name, array)) or join((:a, array), (:b, range)) wouldn't work for cross joining two DataFrames.

That said, one less export would be nice -- I just didn't think of anything that struck me as clean enough.

Separately, I thought of defining crossjoin(xs...; xks...) so crossjoin(df, (:c, 1:2)[, ...]) could be written crossjoin(df, c = 1:2[, ...]), but I was hesitant to give up faithful col ordering and row sorting in cases where any DataFrame was passed in after any AbstractArray (e.g. crossjoin(a = 1:2, df, d = 2:3))

@garborg
Copy link
Contributor Author

garborg commented Feb 7, 2014

Rebased against master.

@nalimilan
Copy link
Member

Sorry, I don't understand you second point. Why wouldn't join(df1, df2, kind = :cross) work?

@garborg
Copy link
Contributor Author

garborg commented Feb 7, 2014

I was just tying to say that either join type would be vary by argument type (join(df1, df2) performing an inner join while join(tuple1, tuple2) performs a cross join, which is also a valid operation for (df1, df2)) or cross joins would require redundant syntax for tuples join(tuple1, tuple2, kind = :cross), which may be cleaner than adding a keyword -- I suppose I also leaned away from combining them because cartesian products seem so different from joining on a key.

As far as implementing the change, I'm not sure if it's possible to define a method with a keyword argument that only takes a specific symbol (:cross), and if not, disambiguation would be a little clunky / ugly. I'd still be open to it, depending one people's preferences.

@nalimilan
Copy link
Member

Actually I had the redundant syntax in mind in my first comment. I'd say throw an error if tuples are passed and kind = :cross is missing. You can just implement the method with tuples and the keyword argument in the signature, and check that kind = :cross AFAICT.

@garborg
Copy link
Contributor Author

garborg commented Feb 7, 2014

I believe that would just involve an extra level of indirection for all crossjoin join methods, an extra check upstream of (:inner, :outer...) in the existing join method, and throwing pseudo-"no method" errors in a couple spots.

So, a little unfortunate clutter under the hood, but happy to do it if there's consensus that the API would be improved.

P.S. I left out a major consideration - if we extended join for a subset of tuples, using DataFrames would ruin this behavior:

join((:a, 1:2), (:b, 2:3))
"a(:b,2:3)1\n2\n"

;)

@johnmyleswhite
Copy link
Contributor

Do we already define join over things that are just tuples? That seems kind of featurey to me. I'd prefer that join only apply to DataFrame and potentially to Dict.

@johnmyleswhite
Copy link
Contributor

Other than my point about tuples, the code here looks good.

For tuples, why not just this something closer to product2df?

In general, I've noticed a few times recently that we'd benefit from an abstraction that takes an iterator that produces rows of data and collects them together.

@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

SQL does cross joins only for tables because that's all they have and on the other end of the spectrum, data.table defines CJ only for vectors (like product2df). Using data.table, most of my uses have been on vectors and combinations of 'tables' and vectors, where I use my own function from a little library of hacks. If product2df takes 'tables' as well as vectors, then no need to define anything else -- if not, it would mean two different functions doing cartesian products and still leaving a hole where the user needs to convert vectors to DataFrames or roll their own function (in the case of mixed inputs).

I'd lean towards moving this functionality to a product2df or keeping tuples here, just because I think having two functions and still leaving a gap in functionality would be a little unsatisfying. But I do recognize it feels featurey.

@johnmyleswhite
Copy link
Contributor

Let's think about this a bit more.

My hesitation mostly reflects the fact that Iterators.jl already provides a tool for providing Cartesian products as an iterable, so it seems like we could provide a tool that turns those iterables into a DataFrame. The value of that approach is that we'd get all the other iterables at the same time.

Partly I bring this up because I recently wrote a function that works a lot like your crossjoin function, but takes in keyword arguments and a function. It generates the Cartesian product and applies the input function to each element of the product, appending it columnwise to the DataFrame constructed from the Cartesian product. To make it work, I build a much of similar functionality to what you've built.

I feel like there's a set of powerful abstractions underlying all of this that we could combine into something really interesting.

@nalimilan
Copy link
Member

All I can say is that it seemed nice to have join() support all kinds of operations SQL allows using $SOMETHING JOIN. I've no opinion on the extension of cross joins to objects other than DataFrames.

@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

That makes a lot of sense, @johnmyleswhite. I'll think about it, too.

@nalimilan Half of me agrees with you about the name, but I'm pulled the other way because 'cross join' was a strange name for the operation of the first place, it's since it's no more similar to the 'main' joins than than is hcat, it doesn't take an on kwarg, and it's a case where it's common, straight forward, and better performing to accept more than two inputs.

@johnmyleswhite
Copy link
Contributor

My three cents:

  • I completely agree with supporting the most important joins. From this list http://en.wikipedia.org/wiki/Join_(SQL), I'd like to support cross join, inner join, left outer join, right outer join and full outer join. Technically, our implementations are equi-joins, but that's good enough for now. So I'd like to merge a form of cross join for DataFrames soon.
  • Reading that article made me appreciate that our attempt to automatically find a matching column name is what they call a natural join in SQL, which is deprecated because of safety issues. I find the arguments for deprecating natural joins quite compelling since it's really uncool for a data pipeline to change its behavior because someone added columns to a data source. As I often feel, using natural joins seems to value convenience too much over reliability.
  • The issue about iterators is a broad one we can discuss in another issue without holding up merging a cross-join for DataFrames. I bring it up because I feel like the tuple argument version of cross-join reflects a broader pattern of functionality that we can find a more general way to express.

If we stopped supporting natural joins, I think it would be easy to support keywords that switch between :cross, :inner, :left, :right and :outer. What do you think of that, @garborg?

@nalimilan
Copy link
Member

Fully agreed. Natural join is dangerous and people can very easily take the intersection of column identifiers if that's really what they want.

@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

Just to be clear, we're still supporting natural joins in the case where there is one matching column name (throwing if there's more), which still could be a little risky, I just saw it as a reasonable compromise given it was already supported -- so the decision would be to get rid of that.

I've been on the line about putting :cross under join, but probably partly because it means to cover many useful crosses, a function elsewhere that returns the same thing when given two DataFrames -- I suppose having two ways to to the same thing may be worth it here, just for familiarity with SQL.

Are you thinking we should get rid of :semi and :anti? Even though they're less common, I like that dplyr has them, and here, with join already supporting :left, supporting the other two is virtually free, and saves the user from choosing between inefficient workarounds and digging into internals.

@johnmyleswhite
Copy link
Contributor

Yeah, I am proposing getting rid of the existing column matching.

Sorry for forgetting :semi and :anti. I think we should keep those.

@johnmyleswhite
Copy link
Contributor

At the risk of being pedantically repetitive, the tuples case of crossjoin already does exist as product. What's missing is a mechanism for collecting the results into a DataFrame and for giving the results names.

@johnmyleswhite
Copy link
Contributor

I'm going to open another issue to discuss generalizations of product.

@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

The changes sound good to me.

I'll test my thoughts on the shortcomings of an unextended product and clarify on that post.

@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

Sorry about that, wrong button.

@johnmyleswhite, my worries about limiting ourselves with respect to cartesian products seems to fit better here than the new issue, so (sorry, it's long):

product doesn't like DataFrames:


crossjoin(df1, df2, t1, t2)
24x6 DataFrame
|-------|---|-----|---|-----|----|----|
| Row # | a | b   | c | d   | e  | f  |
| 1     | 1 | 'a' | 3 | 'd' | 7  | 16 |
| 2     | 1 | 'a' | 3 | 'd' | 7  | 8  |
| 3     | 1 | 'a' | 3 | 'd' | 17 | 16 |
| 4     | 1 | 'a' | 3 | 'd' | 17 | 8  |
| 5     | 1 | 'a' | 4 | 'e' | 7  | 16 |
| 6     | 1 | 'a' | 4 | 'e' | 7  | 8  |
| 7     | 1 | 'a' | 4 | 'e' | 17 | 16 |
| 8     | 1 | 'a' | 4 | 'e' | 17 | 8  |
| 9     | 1 | 'a' | 5 | 'f' | 7  | 16 |
| 10    | 1 | 'a' | 5 | 'f' | 7  | 8  |
| 11    | 1 | 'a' | 5 | 'f' | 17 | 16 |
| 12    | 1 | 'a' | 5 | 'f' | 17 | 8  |
| 13    | 2 | 'b' | 3 | 'd' | 7  | 16 |
| 14    | 2 | 'b' | 3 | 'd' | 7  | 8  |
| 15    | 2 | 'b' | 3 | 'd' | 17 | 16 |
| 16    | 2 | 'b' | 3 | 'd' | 17 | 8  |
| 17    | 2 | 'b' | 4 | 'e' | 7  | 16 |
| 18    | 2 | 'b' | 4 | 'e' | 7  | 8  |
| 19    | 2 | 'b' | 4 | 'e' | 17 | 16 |
| 20    | 2 | 'b' | 4 | 'e' | 17 | 8  |
| 21    | 2 | 'b' | 5 | 'f' | 7  | 16 |
| 22    | 2 | 'b' | 5 | 'f' | 7  | 8  |
| 23    | 2 | 'b' | 5 | 'f' | 17 | 16 |
| 24    | 2 | 'b' | 5 | 'f' | 17 | 8  |

julia> collect(product(df1, df2, [7, 17], [16, 8]))
16-element Array{Any,1}:
 ((:a,[1,2]),(:c,[3,4,5]),7,16)           
 ((:b,['a','b']),(:c,[3,4,5]),7,16)       
 ((:a,[1,2]),(:d,['d','e','f']),7,16)     
 ((:b,['a','b']),(:d,['d','e','f']),7,16) 
 ((:a,[1,2]),(:c,[3,4,5]),17,16)          
 ((:b,['a','b']),(:c,[3,4,5]),17,16)      
 ((:a,[1,2]),(:d,['d','e','f']),17,16)    
 ((:b,['a','b']),(:d,['d','e','f']),17,16)
 ((:a,[1,2]),(:c,[3,4,5]),7,8)            
 ((:b,['a','b']),(:c,[3,4,5]),7,8)        
 ((:a,[1,2]),(:d,['d','e','f']),7,8)      
 ((:b,['a','b']),(:d,['d','e','f']),7,8)  
 ((:a,[1,2]),(:c,[3,4,5]),17,8)           
 ((:b,['a','b']),(:c,[3,4,5]),17,8)       
 ((:a,[1,2]),(:d,['d','e','f']),17,8)     
 ((:b,['a','b']),(:d,['d','e','f']),17,8) 

Allowing only two DataFrames in a cross join means inefficiency:

cj() = crossjoin(df1, df2, df3)
cjcj() = crossjoin(crossjoin(df1, df2), df3)

compares([cj, cjcj], 100_000)

Bytes:
3880
6400

2x7 DataFrame
|-------|-------|----------|------|----------|----------|--------|---------|
| Row # | Split | Function | Rank | Relative | Mean     | Reps   | Elapsed |
| 1     | 1     | "cj"     | 1    | 1.0      | 1.525e-5 | 100000 | 1.52514 |
| 2     | 1     | "cjcj"   | 2    | 1.482    | 2.261e-5 | 100000 | 2.26056 |

collect-ing arrays (to tuples) is 50x slower than crossjoin.

@johnmyleswhite
Copy link
Contributor

You're right: there's some definite efficiency issues with product returning tuples. We may not be able to make this work as I described. But I still want to try to find a smaller set of features that cover more use cases.

One big concern I have is that I don't understand why crossjoin should take in tuples, but none of the other joins would. This is a big part of my sense that we're missing the right level of abstraction. Another one is that I don't understand why we'd use tuples instead of keyword arguments. I'm guessing this is done for efficiency, but keyword arguments seem more Julian.

Sorry to be such an impediment in this issue. I'm just trying to clamp down on functionality that I'm not 100% confident we'll want to support a year from now. I feel like there's a lot of good ideas in this PR, but that, aside from pure DataFrame joining, they're not quite in their final form yet.

@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

Yeah, that makes sense, my viewpoint is that crossjoin isn't a join any more than hcat is, but I may be in the minority there.

Supporting only DataFrames would cover use cases as long as it took more than two arguments, but it would be verbose to convert each vector to a DataFrame, and currently much slower:

Bytes:
26432
54008

2x7 DataFrame
|-------|-------|----------|------|----------|-----------|-------|---------|
| Row # | Split | Function | Rank | Relative | Mean      | Reps  | Elapsed |
| 1     | 1     | "cjt"    | 1    | 1.0      | 2.22e-5   | 10000 | 0.22199 |
| 2     | 1     | "cjdf"   | 2    | 8.455    | 0.0001877 | 10000 | 1.87696 |

For what it's worth, I mentioned above that I wanted keywords. I went with tuples because I didn't know how to determine the order of all args in the call (only the order of non-keyword args separately from the order of kwargs), so if I used kwargs, something like crossjoin(a = 1:2, df, d = 2:3) wouldn't be have col order or row sorting executed faithful to the intent.

I don't see as many cases here as in other frameworks I've used where relying on sorting, etc, is important to efficient code, apart from perhaps creating grouped data frames differently if data is already sorted by the keys. But I imagine when indexing comes back in to play there may be more cases where having things sorted is helpful for more than just looks and display.

Also, I thought tuples would be easier and cleaner for writing UDFs that cross join on an unknown amount of arguments with unknown names, but I wondered if that was just me not knowing my way around Julia well enough yet.

@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

I find the inconsistency of using tuples only here unappealing, too. Here are my thoughts so far on related operations elsewhere in the codebase:

If hcat keeps accepting more than DataFrames (I'd say it should), it would be nice to make the non-DataFrame args kwargs or tuples (the same issue that I don't know how to get the ordering right between kwargs and non-kwargs applies there).

(Side note -- right now hcat accepting other args cuts clutter from user code but still creates intermediate DataFrames, it's probably good to skip the intermediate "make it a DataFrame" step -- that makes a 2x difference for 20 rows, but only 6% difference for 200K rows).

I feel like there might be some parallel with having fast spit-apply-combine operations without ending up with random col names, too, but I haven't thought through the mechanics.

@johnmyleswhite
Copy link
Contributor

Something like crossjoin(a = 1:2, df, d = 2:3) seems really strange to me since the order in which keyword arguments occur shouldn't matter and keyword arguments should, for purely stylistic reasons, happen at the end of a sequence of arguments. I guess that completely rules out using keyword arguments for the case you have in mind.

And you're totally right: I do see cross joins as a kind of join.

One question: is it ever possible in SQL to do a join between more than two tables with a single call to join? I've only ever used join as a binary operator, but that may just be my own idiosyncratic experiences. If it's true that join is always a binary operator in SQL, I don't think we need join to operate on more than two DataFrames at once, even if it could be more efficient. If you want SQL expressiveness, why not use SQL?

@johnmyleswhite
Copy link
Contributor

Regarding tuples: why not use dictionaries instead? I'm a lot more comfortable with that.

@johnmyleswhite
Copy link
Contributor

Regarding your points about hcat, here's my current sense:

  • We need functions that let us tack stuff onto existing DataFrames.
  • We might want to do this row-wise (i.e. vcat) or col-wise (i.e. hcat). We also want to do it left and right.
  • When tacking things on, we want to be able to mutate an argument and we also want non-mutating variants.
  • Since it's really convenient if you can tack scalars, arrays and dictionaries onto DataFrames, we should have some kind of automatic promotion to do that. Skipping intermediate DataFrames while doing this would be a good thing, even if it complicates implementation code.

While doing this, we should push for fewer new abstractions. Right now there's some weird overlap between vcat and append! as currently defined for DataFrames. How would a vcat! function differ from append!?

@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

Regarding hcat and abstraction: Agreed on all your points.

Regarding cross joins only being binary:
I really like your thoughts about not reinventing what's been solved by databases. But I think it's possible to stick too tightly to the spec of declarative SQL when evaluating imperative functions that work on DataFrames, Vectors, etc.
My viewpoint is that having one abstraction that creates DataFrames from cartesian products of various things is worth the cleaner, more efficient user code that comes with it even that means naming it prod2df(...) or crossjoin(...) instead of join(.., kind = :cross) (not that we shouldn't strive for tighter integration with SQL, but I don't think R, Stata, or Python have cross joins defined under join/merge).

Regarding tuples vs dicts: I hadn't considered dictionaries at all, but I don't know why not.

@johnmyleswhite
Copy link
Contributor

I guess my question is: when will doing two separate join operations over two operands each instead of one join over three operands be the bottleneck in someone's data pipeline?

@johnmyleswhite
Copy link
Contributor

So I do agree with you: we should implement something like crossjoin(...). But I'd prefer that we also include the simplest case, which is the Cartesian product of tables, in join.

@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

I think it's safe to say no, but you could say that about a lot of operations and though you'd be right for most workflows, it only seems a valid factor for consideration when efficiency is at odds with factors like consistency of API and simplification of user code.

crossjoin(df1, df2, df3)
# or
join(df1, df2, df3, kind = :cross)
# seem much nicer to read and write than and fit easier into an expression than
join(join(df1, df2, kind = :cross), df3, kind = :cross)

# the same goes for
crossjoin(df1, df2, [:var => 1:9])
# or
crossjoin(df1, df2, (:var, 1:9))
# or
join(df1, df2, (:var, 1:9), kind = :cross)
# vs
join(join(df1, df2, kind = :cross), DataFrame(var = 1:9), kind = :cross)

Whether it's better or worse for consistency of API depends on how we deal elsewhere with avoiding generating variables with random names and separately tracking down and renaming them.

@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

Okay, that makes complete sense to me. I'll add it to join now and make crossjoin, or whatever the name, a separate PR so it has time to improve.

@garborg garborg mentioned this pull request Feb 8, 2014
@garborg
Copy link
Contributor Author

garborg commented Feb 8, 2014

I just put #536 to make the suggested updates to join.

I'm going to close this one because it's a little muddled with the general question of how to specify variable names when passing something other than a DataFrame.

@johnmyleswhite Thanks for fleshing things out with all the back and forth! I regret being such a time suck, but there were some design questions I was struggling with, and I have a clear idea what they actually are now.

@garborg garborg closed this Feb 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants