-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve DataFrame.describe
performance by sorting columns first
#9368
Comments
Yeap, makes sense. This will make |
Sure - is this python only? |
I see now it's python only |
Wouldn't a 100 / 1,000 / 10,000 column sort be enormously expensive, counteracting the fast-paths? Or am I misunderstanding the nature of the optimisation? Or should we constrain a forced-sort to a certain maximum number of columns? 😅 |
@ritchie46 # Min should be "eur"
s = pl.Series("a",["usd","eur",None])
s.min()
# This gives
# "eur"
s.sort().min
# This gives NoneType
s.max()
# "usd"
s.sort().max()
# "usd" (this is correct)
# but...
s.sort(descending=True).max()
# Gives NoneType I think this is the underlying function is Rust: pub(crate) fn min_str(&self) -> Option<&str> {
match self.is_sorted_flag() {
IsSorted::Ascending => self.get(0),
IsSorted::Descending => self.get(self.len() - 1),
IsSorted::Not => self
.downcast_iter()
.filter_map(compute::aggregate::min_string)
.fold_first_(|acc, v| if acc < v { acc } else { v }),
}
} And obviously needs a |
See #9400 |
My understanding is that all the percentile operations do a sort first anyway. However, at present we sort before doing median and the percentiles. However, if we are only doing the median and percentile operations for numeric columns then we can limit the forced-sort to the numeric columns. How does that sound @alexander-beedie? |
The quantiles do part of a sort, but will be O(1) if we pre-sort. I think that presorting is faster because we compute several quantiles and min/max aggregations. Those will all become O(1). |
Somehow I didn't spot earlier that this referred to |
Performance isn't look good for this now I'm testing it again. In my previous examples this approach was much faster, now it's slower. I'll keep looking into it |
describe
performance by sorting columns first
I'll accept a PR for this if there is a benchmark showing it's actually faster on realistic data. |
describe
performance by sorting columns firstDataFrame.describe
performance by sorting columns first
I think this was closed by #13822. |
Problem description
With improvements to describe we now calculate multiple statistics that have fast-track paths. If we called something like this inside describe
We could use those fast-track paths. For my test it reduced time by half. The only changes in the results were very small changes to std for some columns (which itself is possibly a bug)
The text was updated successfully, but these errors were encountered: