Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PX: Avoid groupby when possible and access groups more efficiently #3765

Merged
merged 15 commits into from
Jun 23, 2022

Conversation

nicolaskruchten
Copy link
Contributor

@nicolaskruchten nicolaskruchten commented Jun 8, 2022

Alternative implementation of #3761 to speed up no-groupby-needed cases by about 10x, plus extra work done to access the output of groupby about 3x-4x as fast by enumerating group keys via groupby().indices rather than .groupby().groups and by skipping the one_group lambda groupings altogether.

Note: These speedups apply only to the data-manipulation portion of PX and so will not really be felt much for smaller datasets, and will not speed up transfer from server to client or client-side rendering :)

@jvdd
Copy link
Contributor

jvdd commented Jun 9, 2022

I like your implementation! However, the reason for my (more complex) implementation is to also optimize for the cases when you pass a 1D array instead of a dataframe (e.g., px.scatter(np.arange(2_000_000))).
-> in this particular case 94% of the runtime is used for a (unnecessary) groupby!

This groupby stems from the variable grouper in the grouped_mappings -> 1 (or even multiple) grouper != one_group -> unnecessary groupby operation.
In the previous PR #3761, I observed thst checking whether a groupby is necessary is multitudes more efficient (i.e., adds very little overhead) than peforming an unnecessary groupby.

To also support the case mentioned above, I see 2 options;

  • change the dataframe creation in build_dataframe or change the creation of grouped_mappings in infer_config, as such that passing a 1D array does not result in adding the unnecessary "variable" column in the grouped_mappings
  • check for having all the same group instead of all one group (I did this in this new PR ♻️ check for all same groups #3767)

I think the first option is actually better as (1) it avoids the (minimal) compute overhead for checking whether groupers are all the same group, and (2) it saves memory as no unnecessary "variable" column will be added to the args["data_frame"].
Yet, I implemented the 2nd option as I am not really sure if the required changes for the first option make more sense / could be more duplication.. (+ I'm not really confident in my understanding of how these 2 functions work 🙃 )

@nicolaskruchten
Copy link
Contributor Author

So this PR does in fact handle the "single vector" case just as well as yours I believe:

image

@jvdd
Copy link
Contributor

jvdd commented Jun 9, 2022

So this PR does in fact handle the "single vector" case just as well as yours I believe:

image

Indeed, it handles it now perfectly when you pass they keyword argument(s) (i.e., x=... and/or y=...). But calling px.scatter(x) is not handled in this PR (and is handled in #3767)

@nicolaskruchten
Copy link
Contributor Author

Ah I see, I didn't think about the distinction between x= and data_frame=

@nicolaskruchten
Copy link
Contributor Author

This single-vector case is not a case I had considered very carefully when implementing the "wide form data input" which does the automatic variable insertion.

@nicolaskruchten
Copy link
Contributor Author

OK, so if you want to pass in a single vector to the data_frame argument and you want to skip the automatic color="variable" assignment, there is already a built-in way to do that with the awkward and verbose px.NO_COLOR:

image

I have a better sense now of what you're trying to do in #3761 and if we can find a way to do the same thing without having two places where we do the (already-messy-on-master!) computation of sorted_group_names in different ways, I'd be happy to accept the bigger change.

@nicolaskruchten
Copy link
Contributor Author

@jvdd in this commit I think I've been able to reuse the existing unique() call we make to compute, in effect, all_same_group without doing any additional work compared to what we do on master anyway: 3ae0645

@jvdd
Copy link
Contributor

jvdd commented Jun 9, 2022

Looks good!

Just benchmarked the same value check that I used against .unique and for this test case (2M data points) it seems not to be really worth it (it results in very minimal gains as .unique is significantly faster when all values are the same).
image

@nicolaskruchten
Copy link
Contributor Author

Right, but we already do the .unique() call right now, so my reasoning is that when the vector is *not *unique, we're not incurring any additional performance hit from this optimization.

if col != one_group:
if col == one_group:
single_group_name.append("")
else:
uniques = list(args["data_frame"][col].unique())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the following snippet this line is executed twice for the exact same col ("variable");

import plotly.express as px
px.line([1, 2, 3, 4])

I guess that the second computation could be avoided? (this is why I used set(grouper) to construct the order dictionary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could memoize the results of this loop as a function of col, yeah

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done)

@nicolaskruchten
Copy link
Contributor Author

So your comments about .groups and .get_group() in #3761 made me take a look at how I'm actually accessing the groups and I think that 067d4b0 is about 2x faster.

@nicolaskruchten nicolaskruchten changed the title short circuit more machinery when one_group short-circuit more machinery when one_group Jun 9, 2022
@nicolaskruchten
Copy link
Contributor Author

(OK, but it fails tests. I'll keep plugging away :) )

@nicolaskruchten nicolaskruchten changed the title short-circuit more machinery when one_group PX: Avoid groupby when possible and access groups more efficiently Jun 9, 2022
@jvdd
Copy link
Contributor

jvdd commented Jun 9, 2022

Looks great @nicolaskruchten! Only remark that I have is that we still perform the very expensive & unnecessary one_group grouby operations when at least one grouper results in multiple groups
The benchmark below illustrates how expensive the one_group groupby is - I believe that due constant function evaluation it gets really slow for large dataframes

image

@jvdd
Copy link
Contributor

jvdd commented Jun 9, 2022

I implemented a fix for the unnecessary one_group group by operations in b3a4583

🧹 avoid unnecessary one_group groupby operations
@nicolaskruchten
Copy link
Contributor Author

Wow, that makes a huge difference in performance, indeed, all across every PX function 🎉 🐎 🏁

@jvdd
Copy link
Contributor

jvdd commented Jun 10, 2022

It makes a huge difference indeed!! 🚀 I'm really liking this fast constructor time of plotly.express! I can now visualize 1M datapoints by using the plotly.express interface together with plotly-resampler its data aggregation (under the hood) in less than 1 second (includes data manipulation & rendering time) 🤯 This is a 10x improvement from the time that it takes with plotly master version (v5.8.1) :)

image

@nicolaskruchten
Copy link
Contributor Author

Amazing :)

I must say, yesterday's back and forth was quite fun, and I'm very pleased that you were willing to dive so deep into the PX code to help out, so thank you. Fun fact: the one_group hack that we unwound for good yesterday was the very first code I wrote when I began the Plotly Express project in December 2018, so it was very satisfying to finally clean it up!

@jvdd
Copy link
Contributor

jvdd commented Jun 10, 2022

I liked our iterative development process as well, I had a lot of fun diving into the plotly code (after using its interface for such a long time)! It was an honor to further improve the PX code together with its godfather ;)
I want to thank you as well for the quick & relevant responses + (obviously) for all the great work and your time that you put into this toolkit!

This is exactly why I love open source! ❤️

@nicolaskruchten
Copy link
Contributor Author

so I'll do a bit more QA on this branch before merging and releasing, but I'll try to do a release early next week with these improvements. I have to do a 5.8.2 release right now, so I'll target 5.8.3 for this one!

@nicolaskruchten nicolaskruchten merged commit f83921f into master Jun 23, 2022
@archmoj archmoj deleted the one_group_short_circuit branch March 22, 2024 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants