Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: optimize DataFrame.describe by presorting columns #13822

Merged

Conversation

taki-mekhalfa
Copy link
Contributor

This PR works on #9368

Presorting columns makes describe much faster especially when the user inputs a lot of percentiles;

The presort will happen once, all quantiles/min/max on numerical columns will be O(1);

Here are some benchmarks on real sales data I am working on:

In [18]: df
Out[18]:
shape: (28_061_078, 10)
┌───────────────┬───────────────┬───────────────┬───┬──────────┬──────────┐
│ aaaaa_aaaa_aa ┆ bbbbb_bbbb_bb ┆ ccccc_cccc_cc ┆ … ┆ ddddd_ddd┆ eeee_eee │
│ ---           ┆ ---           ┆ ---           ┆   ┆ ---      ┆ ---      │
│ f64           ┆ f64           ┆ f64           ┆   ┆ f64      ┆ f64      │
╞═══════════════╪═══════════════╪═══════════════╪═══╪══════════╪══════════╡
│ null          ┆ null          ┆ 0.0           ┆ … ┆ 33.38    ┆ 0.0      │
│ null          ┆ null          ┆ 0.409015      ┆ … ┆ 0.0      ┆ 15.55    │
│ null          ┆ null          ┆ 0.0           ┆ … ┆ 18.5     ┆ 0.0      │
│ null          ┆ null          ┆ 0.0           ┆ … ┆ 0.0      ┆ 17.96    │
│ null          ┆ null          ┆ 0.0           ┆ … ┆ 17.36    ┆ 0.0      │
│ …             ┆ …             ┆ …             ┆ … ┆ …        ┆ …        │
│ 77549.54      ┆ 10954.0       ┆ null          ┆ … ┆ null     ┆ null     │
│ null          ┆ null          ┆ null          ┆ … ┆ null     ┆ null     │
│ null          ┆ null          ┆ null          ┆ … ┆ null     ┆ null     │
│ null          ┆ null          ┆ null          ┆ … ┆ null     ┆ null     │
│ 37.44         ┆ 36.0          ┆ null          ┆ … ┆ null     ┆ null     │
└───────────────┴───────────────┴───────────────┴───┴──────────┴──────────┘

In [19]: len(df.columns)
Out[19]: 10

In [12]: %timeit df.describe()
**1.77 s ± 121 ms** per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit df.describe(percentiles=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99])
**5.23 s ± 85 ms** per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit df.select(pl.all().sort()).describe()
**729 ms ± 44.2 ms** per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit df.select(pl.all().sort()).describe(percentiles=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99])
**696 ms ± 25.4 ms** per loop (mean ± std. dev. of 7 runs, 1 loop each)

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jan 18, 2024
@stinodego
Copy link
Member

Thanks! The benchmarks don't really say much without a before/after though - could you do a timing with and without your change? Make sure to use a release build, e.g. make build-release

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 19, 2024

Also: what if the user selects no percentiles? :)

(Presorting seems likely to be a good idea though; if I recall correctly @orlp was thinking about this during discussions about the addition of n_unique to the default statistics).

@taki-mekhalfa
Copy link
Contributor Author

Thanks! The benchmarks don't really say much without a before/after though - could you do a timing with and without your change? Make sure to use a release build, e.g. make build-release

Here is it after a build release native:

In [9]: %timeit df.describe()
1.34 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit df.describe(percentiles=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99])
1.26 s ± 57.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compared to polars 0.20.5:

In [5]: %timeit df.describe()
1.57 s ± 38.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit df.describe(percentiles=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99])
5.14 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@taki-mekhalfa
Copy link
Contributor Author

Also: what if the user selects no percentiles? :)

(This seems likely to be a good idea though; if I recall correctly @orlp was thinking about this during discussions about the potential addition of n_unique to the default stats).

In [11]: %timeit df.describe(percentiles=[])
743 ms ± 57.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compared to polars 0.20.5

In [8]: %timeit df.describe(percentiles=[])
267 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Maybe it's good to not sort if the user did not give any percentiles?

@alexander-beedie
Copy link
Collaborator

Maybe it's good to not sort if the user did not give any percentiles?

That was my suspicion; easy to fix though 😉👍

@orlp
Copy link
Collaborator

orlp commented Jan 19, 2024

I think this is sensible - eventually we will be able to get rid of it again once I've gotten to my planned qcut/quantile rework. Then Polars will efficiently be able to support getting multiple quantiles at once.

Please change it so that it only sorts if more than 1 quantile is requested though. And add a TODO note indicating it should be removed again once Polars properly supports getting multiple quantiles at once.

@taki-mekhalfa
Copy link
Contributor Author

I made the changes; can you please check?

In [10]: %timeit df.describe(percentiles=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99])
1.26 s ± 73.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit df.describe(percentiles=[])
268 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

By presorting numerical columns, quantiles/min/max will be O(1)
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 19, 2024

I made the changes; can you please check?

Can you add a new test that adds coverage for percentiles=[]?
Looks like that case isn't currently validated.

See: py-polars/tests/unit/dataframe/test_describe.py

@taki-mekhalfa
Copy link
Contributor Author

taki-mekhalfa commented Jan 19, 2024

I made the changes; can you please check?

Can you add a new test that adds coverage for percentiles=[]? Looks like that case isn't currently validated.

See: py-polars/tests/unit/dataframe/test_describe.py

There is already a test like that:

@pytest.mark.parametrize("pcts", [None, []])
def test_df_describe_no_percentiles(pcts: list[float] | None) -> None:
    ...

@alexander-beedie
Copy link
Collaborator

There is already a test like that:

Ahh, well spotted; that would cover it.

@ritchie46
Copy link
Member

Is this one good to go @alexander-beedie ?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 23, 2024

Is this one good to go @alexander-beedie ?

@ritchie46: Yup, I don't mind rebasing my own current describe patch (which we need you to referee, lol) on top of it afterwards ;)

@alexander-beedie alexander-beedie merged commit 35e0223 into pola-rs:main Jan 23, 2024
11 checks passed
r-brink pushed a commit to r-brink/polars that referenced this pull request Jan 24, 2024
)

Co-authored-by: taki <taki.mekhalfa@dataimpact.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants