-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚡ avoid expensive & unnecessary groupby in px #3761
Conversation
Thanks for this PR, it looks like a promising optimization! I certainly like the idea of skipping the grouping when it's all |
In the latest commit I did some minor refactoring (renaming variables) + improved the efficiency of the groupby check that I added (replaced I checked the performance for the following plot; (is this what you intended?) import pandas as pd
import numpy as np
import plotly.express as px
df = pd.DataFrame(np.random.randn(5_000, 1), columns=["data"])
df["c"] = np.tile(np.arange(20), 5_000 // 20) # 20 colors
df["s"] = np.tile(np.arange(5), 5_000 // 5) # 5 symbols
fig = px.scatter(df, y="data", color="c", symbol="s") The checks that I added for grouping (i.e., creating the %lprun -f px._core.make_figure px.scatter(df, y="data", color="c", symbol="s") |
# no groups to group by, or when the group has only one (i.e., the same) value | ||
grouper = [g for g in grouper_ if g is not one_group] | ||
assert len(grouper) <= 1 | ||
# -> create orders, sorted_group_names equivalent to those of get_ordings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the optimization where we don't group if the groupers are all one_group
but I'm worried about the duplication of logic here... If we just do the optimization when no grouping is requested at all (i.e. leave out the case when we do have e.g. symbol
but there's only one value) then this PR can be less invasive/have less duplication. I've done an implementation of that over here #3765 if you want to take a look at it.
I think that most of the value of this PR comes from the "all one_group
" case, and that leaving out the case where there happens to only be one group would be OK for now as this is likely not all that common, but I'm not dead set on it.
# no groups to group by, or when the group has only one (i.e., the same) value | ||
grouper = [g for g in grouper_ if g is not one_group] | ||
assert len(grouper) <= 1 | ||
# -> create orders, sorted_group_names equivalent to those of get_ordings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the optimization where we don't group if the groupers are all one_group
but I'm worried about the duplication of logic here... If we just do the optimization when no grouping is requested at all (i.e. leave out the case when we do have e.g. symbol
but there's only one value) then this PR can be less invasive/have less duplication. I've done an implementation of that over here #3765 if you want to take a look at it.
I think that most of the value of this PR comes from the "all one_group
" case, and that leaving out the case where there happens to only be one group would be OK for now as this is likely not all that common, but I'm not dead set on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In digging into this a bit more, I see/remember that we actually call .uniques()
on all non-one_group
groups in the awkward get_orderings()
function anyway, so if we did want to do this other optimization, maybe (re-)inlining the code from get_orderings()
into make_figure()
and interleaving this check would make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean like your implementation in #3765?
Superseded by #3765 |
This PR enables significant speed-ups for
plotly.express
when no groupby operation is required. This PR stems from the observations I made in #1743; current behavior always performs at least one groupby, even when this is not necessary (e.g.,px.scatter
of a 1D array).Performing such a groupby is a rather expensive operation + the
.get_group
and theget_ordenings
operations are also rather time-consuming. This PR checks via theall_same
variable whether grouping is required. If no grouping is necessary, no groupby will be performed, resulting in significant performance gains.Illustration of the speed-up
plotly 5.8.0
this PR
Can you validate that this speed-up is correct?
(FYI: all pytest tests succeed locally)
Code PR checklist
plotly.graph_objects
, my modifications concern thecodegen
files and not generated files.