Improve `DataFrame.describe` performance by sorting columns first #9368

braaannigan · 2023-06-14T10:35:35Z

Problem description

With improvements to describe we now calculate multiple statistics that have fast-track paths. If we called something like this inside describe

df.select(pl.all().sort()).describe()

We could use those fast-track paths. For my test it reduced time by half. The only changes in the results were very small changes to std for some columns (which itself is possibly a bug)

ritchie46 · 2023-06-16T08:26:56Z

Yeap, makes sense. This will make min/max/median/quantile after that free. Can you make a PR?

braaannigan · 2023-06-16T08:30:27Z

Sure - is this python only?

braaannigan · 2023-06-16T14:11:55Z

I see now it's python only

alexander-beedie · 2023-06-16T14:31:30Z

Wouldn't a 100 / 1,000 / 10,000 column sort be enormously expensive, counteracting the fast-paths? Or am I misunderstanding the nature of the optimisation? Or should we constrain a forced-sort to a certain maximum number of columns? 😅

braaannigan · 2023-06-16T14:39:47Z

@ritchie46
The string max/min are incorrect in fast-track mode when there are nulls as the code just picks the first value rather than the first non-null value. For example:

# Min should be "eur"
s = pl.Series("a",["usd","eur",None]) 
s.min()
# This gives
# "eur"
s.sort().min
# This gives NoneType
s.max()
# "usd"
s.sort().max()
# "usd" (this is correct)
# but...
s.sort(descending=True).max()
# Gives NoneType

I think this is the underlying function is Rust:

    pub(crate) fn min_str(&self) -> Option<&str> {
        match self.is_sorted_flag() {
            IsSorted::Ascending => self.get(0),
            IsSorted::Descending => self.get(self.len() - 1),
            IsSorted::Not => self
                .downcast_iter()
                .filter_map(compute::aggregate::min_string)
                .fold_first_(|acc, v| if acc < v { acc } else { v }),
        }
    }

And obviously needs a first_non_null. I've had a go with using the code from the non-string min and max, but I'm getting an error in the tests. I'll push what i've got as a draft PR

braaannigan · 2023-06-16T14:40:42Z

See #9400

braaannigan · 2023-06-19T14:12:28Z

Wouldn't a 100 / 1,000 / 10,000 column sort be enormously expensive, counteracting the fast-paths? Or am I misunderstanding the nature of the optimisation? Or should we constrain a forced-sort to a certain maximum number of columns? 😅

My understanding is that all the percentile operations do a sort first anyway. However, at present we sort before doing median and the percentiles. However, if we are only doing the median and percentile operations for numeric columns then we can limit the forced-sort to the numeric columns. How does that sound @alexander-beedie?

ritchie46 · 2023-06-20T06:26:04Z

The quantiles do part of a sort, but will be O(1) if we pre-sort. I think that presorting is faster because we compute several quantiles and min/max aggregations. Those will all become O(1).

alexander-beedie · 2023-06-20T06:34:57Z

Somehow I didn't spot earlier that this referred to describe, where this all makes much more sense (as you are actively collecting related metrics on every column) 🤣

braaannigan · 2023-06-20T08:22:14Z

Performance isn't look good for this now I'm testing it again. In my previous examples this approach was much faster, now it's slower. I'll keep looking into it

stinodego · 2024-01-11T18:15:15Z

I'll accept a PR for this if there is a benchmark showing it's actually faster on realistic data.

alexander-beedie · 2024-04-07T08:21:16Z

I think this was closed by #13822.

braaannigan added the enhancement New feature or an improvement of an existing feature label Jun 14, 2023

braaannigan mentioned this issue Jun 16, 2023

Attempted to fix string min/max for fast-track, not working #9400

Closed

billylanchantin mentioned this issue Jan 7, 2024

Update Polars to v0.36 elixir-explorer/explorer#797

Closed

5 tasks

stinodego added the performance Performance issues or improvements label Jan 11, 2024

stinodego changed the title ~~Sort all dataframe columns before doing describe~~ Improve describe performance by sorting columns first Jan 11, 2024

stinodego added the accepted Ready for implementation label Jan 11, 2024

stinodego added good first issue Good for newcomers and removed enhancement New feature or an improvement of an existing feature labels Jan 11, 2024

stinodego changed the title ~~Improve describe performance by sorting columns first~~ Improve DataFrame.describe performance by sorting columns first Jan 11, 2024

taki-mekhalfa mentioned this issue Jan 18, 2024

perf: optimize DataFrame.describe by presorting columns #13822

Merged

alexander-beedie closed this as completed Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `DataFrame.describe` performance by sorting columns first #9368

Improve `DataFrame.describe` performance by sorting columns first #9368

braaannigan commented Jun 14, 2023

ritchie46 commented Jun 16, 2023

braaannigan commented Jun 16, 2023

braaannigan commented Jun 16, 2023

alexander-beedie commented Jun 16, 2023

braaannigan commented Jun 16, 2023

braaannigan commented Jun 16, 2023

braaannigan commented Jun 19, 2023

ritchie46 commented Jun 20, 2023

alexander-beedie commented Jun 20, 2023 •

edited

Loading

braaannigan commented Jun 20, 2023

stinodego commented Jan 11, 2024

alexander-beedie commented Apr 7, 2024

Improve DataFrame.describe performance by sorting columns first #9368

Improve DataFrame.describe performance by sorting columns first #9368

Comments

braaannigan commented Jun 14, 2023

Problem description

ritchie46 commented Jun 16, 2023

braaannigan commented Jun 16, 2023

braaannigan commented Jun 16, 2023

alexander-beedie commented Jun 16, 2023

braaannigan commented Jun 16, 2023

braaannigan commented Jun 16, 2023

braaannigan commented Jun 19, 2023

ritchie46 commented Jun 20, 2023

alexander-beedie commented Jun 20, 2023 • edited Loading

braaannigan commented Jun 20, 2023

stinodego commented Jan 11, 2024

alexander-beedie commented Apr 7, 2024

Improve `DataFrame.describe` performance by sorting columns first #9368

Improve `DataFrame.describe` performance by sorting columns first #9368

alexander-beedie commented Jun 20, 2023 •

edited

Loading