perf: optimize `DataFrame.describe` by presorting columns #13822

taki-mekhalfa · 2024-01-18T23:33:35Z

This PR works on #9368

Presorting columns makes describe much faster especially when the user inputs a lot of percentiles;

The presort will happen once, all quantiles/min/max on numerical columns will be O(1);

Here are some benchmarks on real sales data I am working on:

In [18]: df
Out[18]:
shape: (28_061_078, 10)
┌───────────────┬───────────────┬───────────────┬───┬──────────┬──────────┐
│ aaaaa_aaaa_aa ┆ bbbbb_bbbb_bb ┆ ccccc_cccc_cc ┆ … ┆ ddddd_ddd┆ eeee_eee │
│ ---           ┆ ---           ┆ ---           ┆   ┆ ---      ┆ ---      │
│ f64           ┆ f64           ┆ f64           ┆   ┆ f64      ┆ f64      │
╞═══════════════╪═══════════════╪═══════════════╪═══╪══════════╪══════════╡
│ null          ┆ null          ┆ 0.0           ┆ … ┆ 33.38    ┆ 0.0      │
│ null          ┆ null          ┆ 0.409015      ┆ … ┆ 0.0      ┆ 15.55    │
│ null          ┆ null          ┆ 0.0           ┆ … ┆ 18.5     ┆ 0.0      │
│ null          ┆ null          ┆ 0.0           ┆ … ┆ 0.0      ┆ 17.96    │
│ null          ┆ null          ┆ 0.0           ┆ … ┆ 17.36    ┆ 0.0      │
│ …             ┆ …             ┆ …             ┆ … ┆ …        ┆ …        │
│ 77549.54      ┆ 10954.0       ┆ null          ┆ … ┆ null     ┆ null     │
│ null          ┆ null          ┆ null          ┆ … ┆ null     ┆ null     │
│ null          ┆ null          ┆ null          ┆ … ┆ null     ┆ null     │
│ null          ┆ null          ┆ null          ┆ … ┆ null     ┆ null     │
│ 37.44         ┆ 36.0          ┆ null          ┆ … ┆ null     ┆ null     │
└───────────────┴───────────────┴───────────────┴───┴──────────┴──────────┘

In [19]: len(df.columns)
Out[19]: 10

In [12]: %timeit df.describe()
**1.77 s ± 121 ms** per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit df.describe(percentiles=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99])
**5.23 s ± 85 ms** per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit df.select(pl.all().sort()).describe()
**729 ms ± 44.2 ms** per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [15]: %timeit df.select(pl.all().sort()).describe(percentiles=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99])
**696 ms ± 25.4 ms** per loop (mean ± std. dev. of 7 runs, 1 loop each)

stinodego · 2024-01-19T08:46:45Z

Thanks! The benchmarks don't really say much without a before/after though - could you do a timing with and without your change? Make sure to use a release build, e.g. make build-release

alexander-beedie · 2024-01-19T10:18:32Z

Also: what if the user selects no percentiles? :)

(Presorting seems likely to be a good idea though; if I recall correctly @orlp was thinking about this during discussions about the addition of n_unique to the default statistics).

taki-mekhalfa · 2024-01-19T10:40:04Z

Thanks! The benchmarks don't really say much without a before/after though - could you do a timing with and without your change? Make sure to use a release build, e.g. make build-release

Here is it after a build release native:

In [9]: %timeit df.describe()
1.34 s ± 105 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit df.describe(percentiles=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99])
1.26 s ± 57.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compared to polars 0.20.5:

In [5]: %timeit df.describe()
1.57 s ± 38.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit df.describe(percentiles=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99])
5.14 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

taki-mekhalfa · 2024-01-19T10:42:26Z

Also: what if the user selects no percentiles? :)

(This seems likely to be a good idea though; if I recall correctly @orlp was thinking about this during discussions about the potential addition of n_unique to the default stats).

In [11]: %timeit df.describe(percentiles=[])
743 ms ± 57.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compared to polars 0.20.5

In [8]: %timeit df.describe(percentiles=[])
267 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Maybe it's good to not sort if the user did not give any percentiles?

alexander-beedie · 2024-01-19T10:55:09Z

Maybe it's good to not sort if the user did not give any percentiles?

That was my suspicion; easy to fix though 😉👍

orlp · 2024-01-19T11:23:32Z

I think this is sensible - eventually we will be able to get rid of it again once I've gotten to my planned qcut/quantile rework. Then Polars will efficiently be able to support getting multiple quantiles at once.

Please change it so that it only sorts if more than 1 quantile is requested though. And add a TODO note indicating it should be removed again once Polars properly supports getting multiple quantiles at once.

taki-mekhalfa · 2024-01-19T12:20:46Z

I made the changes; can you please check?

In [10]: %timeit df.describe(percentiles=[0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99])
1.26 s ± 73.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit df.describe(percentiles=[])
268 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

By presorting numerical columns, quantiles/min/max will be O(1)

alexander-beedie · 2024-01-19T12:46:41Z

I made the changes; can you please check?

Can you add a new test that adds coverage for percentiles=[]?
Looks like that case isn't currently validated.

See: py-polars/tests/unit/dataframe/test_describe.py

taki-mekhalfa · 2024-01-19T14:28:14Z

I made the changes; can you please check?

Can you add a new test that adds coverage for percentiles=[]? Looks like that case isn't currently validated.

See: py-polars/tests/unit/dataframe/test_describe.py

There is already a test like that:

@pytest.mark.parametrize("pcts", [None, []])
def test_df_describe_no_percentiles(pcts: list[float] | None) -> None:
    ...

alexander-beedie · 2024-01-21T09:51:37Z

There is already a test like that:

Ahh, well spotted; that would cover it.

ritchie46 · 2024-01-23T07:44:43Z

Is this one good to go @alexander-beedie ?

alexander-beedie · 2024-01-23T07:52:30Z

Is this one good to go @alexander-beedie ?

@ritchie46: Yup, I don't mind rebasing my own current describe patch (which we need you to referee, lol) on top of it afterwards ;)

) Co-authored-by: taki <taki.mekhalfa@dataimpact.io>

taki-mekhalfa requested review from ritchie46, stinodego, c-peters, alexander-beedie and MarcoGorelli as code owners January 18, 2024 23:33

github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jan 18, 2024

taki-mekhalfa force-pushed the perf/presort_columns_describe branch from bb3500e to 6077838 Compare January 19, 2024 12:21

perf: optimize DataFrame.describe by presorting columns

db9a0bb

By presorting numerical columns, quantiles/min/max will be O(1)

taki-mekhalfa force-pushed the perf/presort_columns_describe branch from 6077838 to db9a0bb Compare January 19, 2024 12:21

alexander-beedie approved these changes Jan 23, 2024

View reviewed changes

alexander-beedie merged commit 35e0223 into pola-rs:main Jan 23, 2024
11 checks passed

r-brink pushed a commit to r-brink/polars that referenced this pull request Jan 24, 2024

perf: optimize DataFrame.describe by presorting columns (pola-rs#13822

08033fe

) Co-authored-by: taki <taki.mekhalfa@dataimpact.io>

alexander-beedie mentioned this pull request Apr 7, 2024

Improve DataFrame.describe performance by sorting columns first #9368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize `DataFrame.describe` by presorting columns #13822

perf: optimize `DataFrame.describe` by presorting columns #13822

taki-mekhalfa commented Jan 18, 2024

stinodego commented Jan 19, 2024

alexander-beedie commented Jan 19, 2024 •

edited

Loading

taki-mekhalfa commented Jan 19, 2024

taki-mekhalfa commented Jan 19, 2024

alexander-beedie commented Jan 19, 2024

orlp commented Jan 19, 2024 •

edited

Loading

taki-mekhalfa commented Jan 19, 2024

alexander-beedie commented Jan 19, 2024 •

edited

Loading

taki-mekhalfa commented Jan 19, 2024 •

edited

Loading

alexander-beedie commented Jan 21, 2024

ritchie46 commented Jan 23, 2024

alexander-beedie commented Jan 23, 2024 •

edited

Loading

perf: optimize DataFrame.describe by presorting columns #13822

perf: optimize DataFrame.describe by presorting columns #13822

Conversation

taki-mekhalfa commented Jan 18, 2024

stinodego commented Jan 19, 2024

alexander-beedie commented Jan 19, 2024 • edited Loading

taki-mekhalfa commented Jan 19, 2024

taki-mekhalfa commented Jan 19, 2024

alexander-beedie commented Jan 19, 2024

orlp commented Jan 19, 2024 • edited Loading

taki-mekhalfa commented Jan 19, 2024

alexander-beedie commented Jan 19, 2024 • edited Loading

taki-mekhalfa commented Jan 19, 2024 • edited Loading

alexander-beedie commented Jan 21, 2024

ritchie46 commented Jan 23, 2024

alexander-beedie commented Jan 23, 2024 • edited Loading

perf: optimize `DataFrame.describe` by presorting columns #13822

perf: optimize `DataFrame.describe` by presorting columns #13822

alexander-beedie commented Jan 19, 2024 •

edited

Loading

orlp commented Jan 19, 2024 •

edited

Loading

alexander-beedie commented Jan 19, 2024 •

edited

Loading

taki-mekhalfa commented Jan 19, 2024 •

edited

Loading

alexander-beedie commented Jan 23, 2024 •

edited

Loading