PX: Avoid `groupby` when possible and access groups more efficiently #3765

nicolaskruchten · 2022-06-08T20:22:40Z

Alternative implementation of #3761 to speed up no-groupby-needed cases by about 10x, plus extra work done to access the output of groupby about 3x-4x as fast by enumerating group keys via groupby().indices rather than .groupby().groups and by skipping the one_group lambda groupings altogether.

Note: These speedups apply only to the data-manipulation portion of PX and so will not really be felt much for smaller datasets, and will not speed up transfer from server to client or client-side rendering :)

jvdd · 2022-06-09T09:03:28Z

I like your implementation! However, the reason for my (more complex) implementation is to also optimize for the cases when you pass a 1D array instead of a dataframe (e.g., px.scatter(np.arange(2_000_000))).
-> in this particular case 94% of the runtime is used for a (unnecessary) groupby!

This groupby stems from the variable grouper in the grouped_mappings -> 1 (or even multiple) grouper != one_group -> unnecessary groupby operation.
In the previous PR #3761, I observed thst checking whether a groupby is necessary is multitudes more efficient (i.e., adds very little overhead) than peforming an unnecessary groupby.

To also support the case mentioned above, I see 2 options;

change the dataframe creation in build_dataframe or change the creation of grouped_mappings in infer_config, as such that passing a 1D array does not result in adding the unnecessary "variable" column in the grouped_mappings
check for having all the same group instead of all one group (I did this in this new PR ♻️ check for all same groups #3767)

I think the first option is actually better as (1) it avoids the (minimal) compute overhead for checking whether groupers are all the same group, and (2) it saves memory as no unnecessary "variable" column will be added to the args["data_frame"].
Yet, I implemented the 2nd option as I am not really sure if the required changes for the first option make more sense / could be more duplication.. (+ I'm not really confident in my understanding of how these 2 functions work 🙃 )

nicolaskruchten · 2022-06-09T13:55:12Z

So this PR does in fact handle the "single vector" case just as well as yours I believe:

jvdd · 2022-06-09T14:03:33Z

So this PR does in fact handle the "single vector" case just as well as yours I believe:

Indeed, it handles it now perfectly when you pass they keyword argument(s) (i.e., x=... and/or y=...). But calling px.scatter(x) is not handled in this PR (and is handled in #3767)

nicolaskruchten · 2022-06-09T14:15:47Z

Ah I see, I didn't think about the distinction between x= and data_frame=

nicolaskruchten · 2022-06-09T14:18:38Z

This single-vector case is not a case I had considered very carefully when implementing the "wide form data input" which does the automatic variable insertion.

nicolaskruchten · 2022-06-09T14:37:59Z

OK, so if you want to pass in a single vector to the data_frame argument and you want to skip the automatic color="variable" assignment, there is already a built-in way to do that with the awkward and verbose px.NO_COLOR:

I have a better sense now of what you're trying to do in #3761 and if we can find a way to do the same thing without having two places where we do the (already-messy-on-master!) computation of sorted_group_names in different ways, I'd be happy to accept the bigger change.

♻️ check for all same groups

nicolaskruchten · 2022-06-09T15:07:52Z

@jvdd in this commit I think I've been able to reuse the existing unique() call we make to compute, in effect, all_same_group without doing any additional work compared to what we do on master anyway: 3ae0645

jvdd · 2022-06-09T15:51:24Z

Looks good!

Just benchmarked the same value check that I used against .unique and for this test case (2M data points) it seems not to be really worth it (it results in very minimal gains as .unique is significantly faster when all values are the same).

nicolaskruchten · 2022-06-09T15:53:39Z

Right, but we already do the .unique() call right now, so my reasoning is that when the vector is *not *unique, we're not incurring any additional performance hit from this optimization.

jvdd · 2022-06-09T15:55:16Z

packages/python/plotly/plotly/express/_core.py

-        if col != one_group:
+        if col == one_group:
+            single_group_name.append("")
+        else:
            uniques = list(args["data_frame"][col].unique())


For the following snippet this line is executed twice for the exact same col ("variable");

import plotly.express as px px.line([1, 2, 3, 4])

I guess that the second computation could be avoided? (this is why I used set(grouper) to construct the order dictionary)

We could memoize the results of this loop as a function of col, yeah

nicolaskruchten · 2022-06-09T16:53:51Z

So your comments about .groups and .get_group() in #3761 made me take a look at how I'm actually accessing the groups and I think that 067d4b0 is about 2x faster.

nicolaskruchten · 2022-06-09T17:37:21Z

(OK, but it fails tests. I'll keep plugging away :) )

jvdd · 2022-06-09T22:00:53Z

Looks great @nicolaskruchten! Only remark that I have is that we still perform the very expensive & unnecessary one_group grouby operations when at least one grouper results in multiple groups
The benchmark below illustrates how expensive the one_group groupby is - I believe that due constant function evaluation it gets really slow for large dataframes

jvdd · 2022-06-09T22:04:29Z

I implemented a fix for the unnecessary one_group group by operations in b3a4583

🧹 avoid unnecessary one_group groupby operations

nicolaskruchten · 2022-06-10T01:33:06Z

Wow, that makes a huge difference in performance, indeed, all across every PX function 🎉 🐎 🏁

jvdd · 2022-06-10T12:19:28Z

It makes a huge difference indeed!! 🚀 I'm really liking this fast constructor time of plotly.express! I can now visualize 1M datapoints by using the plotly.express interface together with plotly-resampler its data aggregation (under the hood) in less than 1 second (includes data manipulation & rendering time) 🤯 This is a 10x improvement from the time that it takes with plotly master version (v5.8.1) :)

nicolaskruchten · 2022-06-10T14:19:43Z

Amazing :)

I must say, yesterday's back and forth was quite fun, and I'm very pleased that you were willing to dive so deep into the PX code to help out, so thank you. Fun fact: the one_group hack that we unwound for good yesterday was the very first code I wrote when I began the Plotly Express project in December 2018, so it was very satisfying to finally clean it up!

jvdd · 2022-06-10T14:55:49Z

I liked our iterative development process as well, I had a lot of fun diving into the plotly code (after using its interface for such a long time)! It was an honor to further improve the PX code together with its godfather ;)
I want to thank you as well for the quick & relevant responses + (obviously) for all the great work and your time that you put into this toolkit!

This is exactly why I love open source! ❤️

nicolaskruchten · 2022-06-10T15:01:52Z

so I'll do a bit more QA on this branch before merging and releasing, but I'll try to do a release early next week with these improvements. I have to do a 5.8.2 release right now, so I'll target 5.8.3 for this one!

short circuit more machinery when one_group

9941749

nicolaskruchten mentioned this pull request Jun 8, 2022

⚡ avoid expensive & unnecessary groupby in px #3761

Closed

5 tasks

nicolaskruchten and others added 2 commits June 8, 2022 21:36

more correct optimization

b97d197

♻️ check for all same groups

6640d8f

jvdd mentioned this pull request Jun 9, 2022

♻️ check for all same groups #3767

Merged

jvdd and others added 3 commits June 9, 2022 16:40

🐛 make orders and sorted_group_names backwards compatible

73b3c7c

Merge pull request #3767 from jvdd/one_group_short_circuit

dd2137b

♻️ check for all same groups

reorder computation to avoid an extra scan through the data

3ae0645

jvdd reviewed Jun 9, 2022

View reviewed changes

optimize group access

067d4b0

nicolaskruchten changed the title ~~short circuit more machinery when one_group~~ short-circuit more machinery when one_group Jun 9, 2022

nicolaskruchten added 2 commits June 9, 2022 13:18

slicing seems ambitious

9498396

trying again with groups

35dbbe8

possible fix

cdae77c

nicolaskruchten changed the title ~~short-circuit more machinery when one_group~~ PX: Avoid groupby when possible and access groups more efficiently Jun 9, 2022

🧹 avoid unnecessary one_group groupby operations

b3a4583

Merge pull request #3769 from jvdd/one_group_short_circuit

3043324

🧹 avoid unnecessary one_group groupby operations

nicolaskruchten added 3 commits June 23, 2022 09:38

Merge branch 'master' into one_group_short_circuit

f97f2d2

changelog

f46f809

Merge branch 'master' into one_group_short_circuit

4b22199

nicolaskruchten merged commit f83921f into master Jun 23, 2022

archmoj deleted the one_group_short_circuit branch March 22, 2024 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PX: Avoid `groupby` when possible and access groups more efficiently #3765

PX: Avoid `groupby` when possible and access groups more efficiently #3765

nicolaskruchten commented Jun 8, 2022 •

edited

Loading

jvdd commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

jvdd commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

jvdd commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

jvdd Jun 9, 2022

nicolaskruchten Jun 9, 2022

nicolaskruchten Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

jvdd commented Jun 9, 2022 •

edited

Loading

jvdd commented Jun 9, 2022

nicolaskruchten commented Jun 10, 2022

jvdd commented Jun 10, 2022 •

edited

Loading

nicolaskruchten commented Jun 10, 2022

jvdd commented Jun 10, 2022

nicolaskruchten commented Jun 10, 2022

PX: Avoid groupby when possible and access groups more efficiently #3765

PX: Avoid groupby when possible and access groups more efficiently #3765

Conversation

nicolaskruchten commented Jun 8, 2022 • edited Loading

jvdd commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

jvdd commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

jvdd commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

jvdd Jun 9, 2022

Choose a reason for hiding this comment

nicolaskruchten Jun 9, 2022

Choose a reason for hiding this comment

nicolaskruchten Jun 9, 2022

Choose a reason for hiding this comment

nicolaskruchten commented Jun 9, 2022

nicolaskruchten commented Jun 9, 2022

jvdd commented Jun 9, 2022 • edited Loading

jvdd commented Jun 9, 2022

nicolaskruchten commented Jun 10, 2022

jvdd commented Jun 10, 2022 • edited Loading

nicolaskruchten commented Jun 10, 2022

jvdd commented Jun 10, 2022

nicolaskruchten commented Jun 10, 2022

PX: Avoid `groupby` when possible and access groups more efficiently #3765

PX: Avoid `groupby` when possible and access groups more efficiently #3765

nicolaskruchten commented Jun 8, 2022 •

edited

Loading

jvdd commented Jun 9, 2022 •

edited

Loading

jvdd commented Jun 10, 2022 •

edited

Loading