-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: optimize DataFrame.describe
by presorting columns
#13822
perf: optimize DataFrame.describe
by presorting columns
#13822
Conversation
Thanks! The benchmarks don't really say much without a before/after though - could you do a timing with and without your change? Make sure to use a release build, e.g. |
Also: what if the user selects no percentiles? :) (Presorting seems likely to be a good idea though; if I recall correctly @orlp was thinking about this during discussions about the addition of |
Here is it after a build release native:
Compared to polars
|
Compared to polars
Maybe it's good to not sort if the user did not give any percentiles? |
That was my suspicion; easy to fix though 😉👍 |
I think this is sensible - eventually we will be able to get rid of it again once I've gotten to my planned qcut/quantile rework. Then Polars will efficiently be able to support getting multiple quantiles at once. Please change it so that it only sorts if more than 1 quantile is requested though. And add a |
I made the changes; can you please check?
|
bb3500e
to
6077838
Compare
By presorting numerical columns, quantiles/min/max will be O(1)
6077838
to
db9a0bb
Compare
Can you add a new test that adds coverage for See: |
There is already a test like that: @pytest.mark.parametrize("pcts", [None, []])
def test_df_describe_no_percentiles(pcts: list[float] | None) -> None:
... |
Ahh, well spotted; that would cover it. |
Is this one good to go @alexander-beedie ? |
@ritchie46: Yup, I don't mind rebasing my own current |
This PR works on #9368
Presorting columns makes describe much faster especially when the user inputs a lot of percentiles;
The presort will happen once, all quantiles/min/max on numerical columns will be O(1);
Here are some benchmarks on real sales data I am working on: