-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PX: Avoid groupby
when possible and access groups more efficiently
#3765
Conversation
I like your implementation! However, the reason for my (more complex) implementation is to also optimize for the cases when you pass a 1D array instead of a dataframe (e.g., This groupby stems from the To also support the case mentioned above, I see 2 options;
I think the first option is actually better as (1) it avoids the (minimal) compute overhead for checking whether groupers are all the same group, and (2) it saves memory as no unnecessary "variable" column will be added to the |
Indeed, it handles it now perfectly when you pass they keyword argument(s) (i.e., |
Ah I see, I didn't think about the distinction between |
This single-vector case is not a case I had considered very carefully when implementing the "wide form data input" which does the automatic |
OK, so if you want to pass in a single vector to the I have a better sense now of what you're trying to do in #3761 and if we can find a way to do the same thing without having two places where we do the (already-messy-on- |
Right, but we already do the |
if col != one_group: | ||
if col == one_group: | ||
single_group_name.append("") | ||
else: | ||
uniques = list(args["data_frame"][col].unique()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the following snippet this line is executed twice for the exact same col ("variable");
import plotly.express as px
px.line([1, 2, 3, 4])
I guess that the second computation could be avoided? (this is why I used set(grouper)
to construct the order dictionary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could memoize the results of this loop as a function of col
, yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(done)
one_group
one_group
(OK, but it fails tests. I'll keep plugging away :) ) |
one_group
groupby
when possible and access groups more efficiently
Looks great @nicolaskruchten! Only remark that I have is that we still perform the very expensive & unnecessary |
I implemented a fix for the unnecessary |
🧹 avoid unnecessary one_group groupby operations
Wow, that makes a huge difference in performance, indeed, all across every PX function 🎉 🐎 🏁 |
It makes a huge difference indeed!! 🚀 I'm really liking this fast constructor time of |
Amazing :) I must say, yesterday's back and forth was quite fun, and I'm very pleased that you were willing to dive so deep into the PX code to help out, so thank you. Fun fact: the |
I liked our iterative development process as well, I had a lot of fun diving into the plotly code (after using its interface for such a long time)! It was an honor to further improve the PX code together with its godfather ;) This is exactly why I love open source! ❤️ |
so I'll do a bit more QA on this branch before merging and releasing, but I'll try to do a release early next week with these improvements. I have to do a 5.8.2 release right now, so I'll target 5.8.3 for this one! |
Alternative implementation of #3761 to speed up no-
groupby
-needed cases by about 10x, plus extra work done to access the output ofgroupby
about 3x-4x as fast by enumerating group keys viagroupby().indices
rather than.groupby().groups
and by skipping theone_group
lambda groupings altogether.Note: These speedups apply only to the data-manipulation portion of PX and so will not really be felt much for smaller datasets, and will not speed up transfer from server to client or client-side rendering :)