Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DataFrame.describe performance by sorting columns first #9368

Closed
braaannigan opened this issue Jun 14, 2023 · 12 comments
Closed

Improve DataFrame.describe performance by sorting columns first #9368

braaannigan opened this issue Jun 14, 2023 · 12 comments
Labels
accepted Ready for implementation good first issue Good for newcomers performance Performance issues or improvements

Comments

@braaannigan
Copy link
Collaborator

Problem description

With improvements to describe we now calculate multiple statistics that have fast-track paths. If we called something like this inside describe

df.select(pl.all().sort()).describe()

We could use those fast-track paths. For my test it reduced time by half. The only changes in the results were very small changes to std for some columns (which itself is possibly a bug)

@braaannigan braaannigan added the enhancement New feature or an improvement of an existing feature label Jun 14, 2023
@ritchie46
Copy link
Member

Yeap, makes sense. This will make min/max/median/quantile after that free. Can you make a PR?

@braaannigan
Copy link
Collaborator Author

Sure - is this python only?

@braaannigan
Copy link
Collaborator Author

I see now it's python only

@alexander-beedie
Copy link
Collaborator

Wouldn't a 100 / 1,000 / 10,000 column sort be enormously expensive, counteracting the fast-paths? Or am I misunderstanding the nature of the optimisation? Or should we constrain a forced-sort to a certain maximum number of columns? 😅

@braaannigan
Copy link
Collaborator Author

@ritchie46
The string max/min are incorrect in fast-track mode when there are nulls as the code just picks the first value rather than the first non-null value. For example:

# Min should be "eur"
s = pl.Series("a",["usd","eur",None]) 
s.min()
# This gives
# "eur"
s.sort().min
# This gives NoneType
s.max()
# "usd"
s.sort().max()
# "usd" (this is correct)
# but...
s.sort(descending=True).max()
# Gives NoneType

I think this is the underlying function is Rust:

    pub(crate) fn min_str(&self) -> Option<&str> {
        match self.is_sorted_flag() {
            IsSorted::Ascending => self.get(0),
            IsSorted::Descending => self.get(self.len() - 1),
            IsSorted::Not => self
                .downcast_iter()
                .filter_map(compute::aggregate::min_string)
                .fold_first_(|acc, v| if acc < v { acc } else { v }),
        }
    }

And obviously needs a first_non_null. I've had a go with using the code from the non-string min and max, but I'm getting an error in the tests. I'll push what i've got as a draft PR

@braaannigan
Copy link
Collaborator Author

See #9400

@braaannigan
Copy link
Collaborator Author

Wouldn't a 100 / 1,000 / 10,000 column sort be enormously expensive, counteracting the fast-paths? Or am I misunderstanding the nature of the optimisation? Or should we constrain a forced-sort to a certain maximum number of columns? 😅

My understanding is that all the percentile operations do a sort first anyway. However, at present we sort before doing median and the percentiles. However, if we are only doing the median and percentile operations for numeric columns then we can limit the forced-sort to the numeric columns. How does that sound @alexander-beedie?

@ritchie46
Copy link
Member

The quantiles do part of a sort, but will be O(1) if we pre-sort. I think that presorting is faster because we compute several quantiles and min/max aggregations. Those will all become O(1).

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jun 20, 2023

Somehow I didn't spot earlier that this referred to describe, where this all makes much more sense (as you are actively collecting related metrics on every column) 🤣

@braaannigan
Copy link
Collaborator Author

Performance isn't look good for this now I'm testing it again. In my previous examples this approach was much faster, now it's slower. I'll keep looking into it

@stinodego stinodego added the performance Performance issues or improvements label Jan 11, 2024
@stinodego stinodego changed the title Sort all dataframe columns before doing describe Improve describe performance by sorting columns first Jan 11, 2024
@stinodego stinodego added the accepted Ready for implementation label Jan 11, 2024
@stinodego
Copy link
Member

I'll accept a PR for this if there is a benchmark showing it's actually faster on realistic data.

@stinodego stinodego added good first issue Good for newcomers and removed enhancement New feature or an improvement of an existing feature labels Jan 11, 2024
@stinodego stinodego changed the title Improve describe performance by sorting columns first Improve DataFrame.describe performance by sorting columns first Jan 11, 2024
@alexander-beedie
Copy link
Collaborator

I think this was closed by #13822.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation good first issue Good for newcomers performance Performance issues or improvements
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants