-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a leftjoin!
(or match!
or merge!
or whatever it should be called)
#2259
Comments
First we should finalize the API for #2243 as this is related. When we have a decision on "update" stuff, then this can be implemented at the same time.
|
Sure! I think 1.0 release will be in the end of 2020 so there is plenty of time for it. Thank you! Note though that for performance reasons this should be a special implementation anyway (as we will not copy left data frame and we will only need to compute the indices for the right data frame) |
Yeah, if and when I do it I might look into breaking up |
Well - we will probably redesign |
I was really tempted to start messing around with I'd really like to rewrite the joins to use |
Actually I was considering to define However, if we can make it fast, the additional benefit of such an approach would be that doing several So in general: this direction is very promising and I would like to jointly explore it more soon! |
Ok, interesting to hear we had the same thought about this. I can't think of any reason why it couldn't be efficient. The only potential issue that I thought of with this approach that concerned me is that it might be less efficient to generate the group keys and then hash them rather than just calling some dead-simple broadcasted hash function on the rows before even getting into the groupby stuff. I would think that the most expensive part of all of this is the actual concatenation (we clearly don't want views for the joins), and I don't see any reason why that couldn't be just as fast using groupby than by going a more direct route. I think there may also be some extra "fluff" in the user-facing It seems to me that a |
As I think of it I see the following performance optimization:
|
Hm, interesting, but wouldn't that require large modifications to the |
The benefit is. That by calling two
and after this you still need to do mapping between group numbers in On the other hand if you are doing it in a single processing step, you do not need to create
telling you which rows of the second data frame map to which groups of the first data frame. Out of such a data structure it is really easy to properly resize first data frame (you just look at the lengths of |
I very frequently find myself needing to do
It's not that big a deal to have a new dataframe (after all, the underlying columns weren't copied) but I find myself having to do this so often (this is probably one of the most commonly used operations for me) that I would really love a mutating function for this. Maybe something like
I started to look into what would be involved in doing this, but
_join
is pretty monolithic so it wasn't immediately obvious. I'd think that the biggest subtlety would be whether the ordering of rows indf
can change (but I don't think they can even as things are?).Thoughts? Am I the only one who does this all the time?
The text was updated successfully, but these errors were encountered: