Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡ avoid expensive & unnecessary groupby in px #3761

Closed
wants to merge 3 commits into from

Conversation

jvdd
Copy link
Contributor

@jvdd jvdd commented Jun 6, 2022

This PR enables significant speed-ups for plotly.express when no groupby operation is required. This PR stems from the observations I made in #1743; current behavior always performs at least one groupby, even when this is not necessary (e.g., px.scatter of a 1D array).

Performing such a groupby is a rather expensive operation + the .get_group and the get_ordenings operations are also rather time-consuming. This PR checks via the all_same variable whether grouping is required. If no grouping is necessary, no groupby will be performed, resulting in significant performance gains.

Illustration of the speed-up

plotly 5.8.0
image

this PR
image

Can you validate that this speed-up is correct?
(FYI: all pytest tests succeed locally)

Code PR checklist

  • I have read through the contributing notes and understand the structure of the package. In particular, if my PR modifies code of plotly.graph_objects, my modifications concern the codegen files and not generated files.
  • I have added tests (if submitting a new feature or correcting a bug) or modified existing tests.
  • For a new feature, I have added documentation examples in an existing or new tutorial notebook (please see the doc checklist as well).
  • I have added a CHANGELOG entry if fixing/changing/adding anything substantial.
  • For a new feature or a change in behaviour, I have updated the relevant docstrings in the code to describe the feature or behaviour (please see the doc checklist as well).

@nicolaskruchten
Copy link
Contributor

Thanks for this PR, it looks like a promising optimization!

I certainly like the idea of skipping the grouping when it's all one_group... When plotting, say, a 5000-row dataframe with 20 colors and 5 symbols or something, what is the performance hit here of the nunique call?

@jvdd
Copy link
Contributor Author

jvdd commented Jun 7, 2022

Hi @nicolaskruchten

In the latest commit I did some minor refactoring (renaming variables) + improved the efficiency of the groupby check that I added (replaced .nunique with the suggestion from here https://stackoverflow.com/a/54405767)

I checked the performance for the following plot; (is this what you intended?)

import pandas as pd
import numpy as np
import plotly.express as px

df = pd.DataFrame(np.random.randn(5_000, 1), columns=["data"])
df["c"] = np.tile(np.arange(20), 5_000 // 20)  # 20 colors
df["s"] = np.tile(np.arange(5), 5_000 // 5)  # 5 symbols

fig = px.scatter(df, y="data", color="c", symbol="s")

The checks that I added for grouping (i.e., creating the all_same_group variable) take 0.1-0.2% of the total time of make_figure.

%lprun -f px._core.make_figure px.scatter(df, y="data", color="c", symbol="s")

# no groups to group by, or when the group has only one (i.e., the same) value
grouper = [g for g in grouper_ if g is not one_group]
assert len(grouper) <= 1
# -> create orders, sorted_group_names equivalent to those of get_ordings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the optimization where we don't group if the groupers are all one_group but I'm worried about the duplication of logic here... If we just do the optimization when no grouping is requested at all (i.e. leave out the case when we do have e.g. symbol but there's only one value) then this PR can be less invasive/have less duplication. I've done an implementation of that over here #3765 if you want to take a look at it.

I think that most of the value of this PR comes from the "all one_group" case, and that leaving out the case where there happens to only be one group would be OK for now as this is likely not all that common, but I'm not dead set on it.

# no groups to group by, or when the group has only one (i.e., the same) value
grouper = [g for g in grouper_ if g is not one_group]
assert len(grouper) <= 1
# -> create orders, sorted_group_names equivalent to those of get_ordings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the optimization where we don't group if the groupers are all one_group but I'm worried about the duplication of logic here... If we just do the optimization when no grouping is requested at all (i.e. leave out the case when we do have e.g. symbol but there's only one value) then this PR can be less invasive/have less duplication. I've done an implementation of that over here #3765 if you want to take a look at it.

I think that most of the value of this PR comes from the "all one_group" case, and that leaving out the case where there happens to only be one group would be OK for now as this is likely not all that common, but I'm not dead set on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In digging into this a bit more, I see/remember that we actually call .uniques() on all non-one_group groups in the awkward get_orderings() function anyway, so if we did want to do this other optimization, maybe (re-)inlining the code from get_orderings() into make_figure() and interleaving this check would make more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like your implementation in #3765?

@nicolaskruchten
Copy link
Contributor

Superseded by #3765

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants