⚡ avoid expensive & unnecessary groupby in px #3761

jvdd · 2022-06-06T18:08:59Z

This PR enables significant speed-ups for plotly.express when no groupby operation is required. This PR stems from the observations I made in #1743; current behavior always performs at least one groupby, even when this is not necessary (e.g., px.scatter of a 1D array).

Performing such a groupby is a rather expensive operation + the .get_group and the get_ordenings operations are also rather time-consuming. This PR checks via the all_same variable whether grouping is required. If no grouping is necessary, no groupby will be performed, resulting in significant performance gains.

Illustration of the speed-up

plotly 5.8.0

this PR

Can you validate that this speed-up is correct?
(FYI: all pytest tests succeed locally)

Code PR checklist

I have read through the contributing notes and understand the structure of the package. In particular, if my PR modifies code of plotly.graph_objects, my modifications concern the codegen files and not generated files.
I have added tests (if submitting a new feature or correcting a bug) or modified existing tests.
For a new feature, I have added documentation examples in an existing or new tutorial notebook (please see the doc checklist as well).
I have added a CHANGELOG entry if fixing/changing/adding anything substantial.
For a new feature or a change in behaviour, I have updated the relevant docstrings in the code to describe the feature or behaviour (please see the doc checklist as well).

nicolaskruchten · 2022-06-06T21:21:30Z

Thanks for this PR, it looks like a promising optimization!

I certainly like the idea of skipping the grouping when it's all one_group... When plotting, say, a 5000-row dataframe with 20 colors and 5 symbols or something, what is the performance hit here of the nunique call?

jvdd · 2022-06-07T08:52:01Z

Hi @nicolaskruchten

In the latest commit I did some minor refactoring (renaming variables) + improved the efficiency of the groupby check that I added (replaced .nunique with the suggestion from here https://stackoverflow.com/a/54405767)

I checked the performance for the following plot; (is this what you intended?)

import pandas as pd
import numpy as np
import plotly.express as px

df = pd.DataFrame(np.random.randn(5_000, 1), columns=["data"])
df["c"] = np.tile(np.arange(20), 5_000 // 20)  # 20 colors
df["s"] = np.tile(np.arange(5), 5_000 // 5)  # 5 symbols

fig = px.scatter(df, y="data", color="c", symbol="s")

The checks that I added for grouping (i.e., creating the all_same_group variable) take 0.1-0.2% of the total time of make_figure.

%lprun -f px._core.make_figure px.scatter(df, y="data", color="c", symbol="s")

nicolaskruchten · 2022-06-08T20:23:44Z

packages/python/plotly/plotly/express/_core.py

+        # no groups to group by, or when the group has only one (i.e., the same) value
+        grouper = [g for g in grouper_ if g is not one_group]
+        assert len(grouper) <= 1
+        # -> create orders, sorted_group_names equivalent to those of get_ordings


I love the optimization where we don't group if the groupers are all one_group but I'm worried about the duplication of logic here... If we just do the optimization when no grouping is requested at all (i.e. leave out the case when we do have e.g. symbol but there's only one value) then this PR can be less invasive/have less duplication. I've done an implementation of that over here #3765 if you want to take a look at it.

I think that most of the value of this PR comes from the "all one_group" case, and that leaving out the case where there happens to only be one group would be OK for now as this is likely not all that common, but I'm not dead set on it.

nicolaskruchten · 2022-06-08T20:23:44Z

packages/python/plotly/plotly/express/_core.py

+        # no groups to group by, or when the group has only one (i.e., the same) value
+        grouper = [g for g in grouper_ if g is not one_group]
+        assert len(grouper) <= 1
+        # -> create orders, sorted_group_names equivalent to those of get_ordings


I love the optimization where we don't group if the groupers are all one_group but I'm worried about the duplication of logic here... If we just do the optimization when no grouping is requested at all (i.e. leave out the case when we do have e.g. symbol but there's only one value) then this PR can be less invasive/have less duplication. I've done an implementation of that over here #3765 if you want to take a look at it.

I think that most of the value of this PR comes from the "all one_group" case, and that leaving out the case where there happens to only be one group would be OK for now as this is likely not all that common, but I'm not dead set on it.

In digging into this a bit more, I see/remember that we actually call .uniques() on all non-one_group groups in the awkward get_orderings() function anyway, so if we did want to do this other optimization, maybe (re-)inlining the code from get_orderings() into make_figure() and interleaving this check would make more sense?

Do you mean like your implementation in #3765?

nicolaskruchten · 2022-06-10T15:09:06Z

Superseded by #3765

jvdd added 2 commits June 6, 2022 19:35

⚡ avoid expensive & unnecessary groupby in px

fa24c1a

🙈 fix formatting

9e93ac8

⚡ faster check for all same values + refactor code

958014a

nicolaskruchten mentioned this pull request Jun 8, 2022

PX: Avoid groupby when possible and access groups more efficiently #3765

Merged

nicolaskruchten reviewed Jun 8, 2022

View reviewed changes

nicolaskruchten mentioned this pull request Jun 9, 2022

♻️ check for all same groups #3767

Merged

nicolaskruchten closed this Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ avoid expensive & unnecessary groupby in px #3761

⚡ avoid expensive & unnecessary groupby in px #3761

jvdd commented Jun 6, 2022 •

edited

Loading

nicolaskruchten commented Jun 6, 2022

jvdd commented Jun 7, 2022 •

edited

Loading

nicolaskruchten Jun 8, 2022

nicolaskruchten Jun 8, 2022

nicolaskruchten Jun 9, 2022

jvdd Jun 9, 2022

nicolaskruchten commented Jun 10, 2022

⚡ avoid expensive & unnecessary groupby in px #3761

⚡ avoid expensive & unnecessary groupby in px #3761

Conversation

jvdd commented Jun 6, 2022 • edited Loading

Illustration of the speed-up

Code PR checklist

nicolaskruchten commented Jun 6, 2022

jvdd commented Jun 7, 2022 • edited Loading

nicolaskruchten Jun 8, 2022

Choose a reason for hiding this comment

nicolaskruchten Jun 8, 2022

Choose a reason for hiding this comment

nicolaskruchten Jun 9, 2022

Choose a reason for hiding this comment

jvdd Jun 9, 2022

Choose a reason for hiding this comment

nicolaskruchten commented Jun 10, 2022

jvdd commented Jun 6, 2022 •

edited

Loading

jvdd commented Jun 7, 2022 •

edited

Loading