-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Quantile function in SQL #18047
feat: Quantile function in SQL #18047
Conversation
967108d
to
bb4cf5a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18047 +/- ##
==========================================
- Coverage 80.46% 79.78% -0.68%
==========================================
Files 1496 1531 +35
Lines 197234 208445 +11211
Branches 2820 2913 +93
==========================================
+ Hits 158700 166310 +7610
- Misses 38012 41584 +3572
- Partials 522 551 +29 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: there is no QUANTILE
function in PostgreSQL (or ANSI SQL), which is what we usually aim to match. Can you reformulate this to expose one (or both) of the standard PostgreSQL functions, PERCENTILE_DISC
and/or PERCENTILE_CONT
1 instead? I think you will also need to integrate parsing support for WITHIN GROUP
to conform to the expected syntax, eg:
SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY sales)
FROM transactions
This would be a great addition to our SQL interface and I'd be very happy to see this capability merged! 😎👍
Looking at DuckDB they expose QUANTILE_DISC
and QUANTILE_CONT
2 in addition to the two percentile functions above, so we should probably match that API for quantile funcs (we have a few other "friendly" DuckDB functions integrated, such as COLUMNS
3, so there is precedent).
So, ideally...
PERCENTILE_DISC
PERCENTILE_CONT
QUANTILE_DISC
QUANTILE_CONT
...but if adding the two PERCENTILE_*
funcs (and WITHIN GROUP
) feels a bit much then I'd also approve just the two QUANTILE_*
funcs if they conform to DuckDB syntax (which doesn't require WITHIN GROUP
and looks close to your existing implementation).
Footnotes
-
PostgreSQL Ordered-Set Aggregate Functions:
https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE ↩ -
DuckDB Quantile Functions:
https://duckdb.org/docs/sql/functions/aggregates.html#quantile_contx-pos
https://duckdb.org/docs/sql/functions/aggregates.html#quantile_discx-pos ↩ -
Added
COLUMNS
SQL Support:
https://github.com/pola-rs/polars/pull/17295 ↩
I believe I need to gain a deeper understanding of the SQL engine used to implement something more complex than a function itself (required for Related question: I was wondering if it's worth to leave my proposed API as-is (or only leave the |
No problem! And I agree, adding support for the extra syntax is a larger PR 😆
No, because it has no analog in any standard SQL syntax; we aren't just using the SQL interface to expose custom Polars functions with a custom API, because otherwise that's not really SQL. Since there are SQL examples of Note: if there's any disagreement between the DuckDB and Databend syntax then I'd favour DuckDB, as we already have a few of their functions. That way we standardise on PostgreSQL (+ DuckDB if PostgreSQL does not have the desired functions), which helps clarify the expected SQL syntax for users. Footnotes
|
Is it ok now? I have an internal feature in the company depending on this and I would like to at least establish the interface to avoid refactoring client's code later. I was also wondering what's the release policy of the rust crates and whether I should stay on git dependencies for the near future? P.S. I've looked through sqlparser and it seems like WITH GROUP syntax is supported out of the box, so this next PR with percentiles might be coming sooner that I though. |
The syntax looks good to me. I just gave it a quick sanity-check and it's returning the results I expect; will give it a more thorough test tomorrow (edge-cases, behaviour with nulls, etc). One thing I spotted - at the moment Example: df.sql(
"""
SELECT
QUANTILE_DISC(Sales, 0.10) as q10,
QUANTILE_DISC(Sales, 0.25) as q25,
QUANTILE_DISC(Sales, 0.40) as q40,
QUANTILE_DISC(Sales, 0.55) as q55,
QUANTILE_DISC(Sales, 0.70) as q70,
QUANTILE_DISC(Sales, 0.85) as q85
FROM self
"""
)
# ┌────────┬────────┬────────┬────────┬────────┬────────┐
# │ q10 ┆ q25 ┆ q40 ┆ q55 ┆ q70 ┆ q85 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ << should all be int
# ╞════════╪════════╪════════╪════════╪════════╪════════╡
# │ 1000.0 ┆ 2000.0 ┆ 3000.0 ┆ 3000.0 ┆ 4000.0 ┆ 5000.0 │
# └────────┴────────┴────────┴────────┴────────┴────────┘ If you can fix that (and further testing look good tomorrow) then this should be good merge shortly 👍
Yes, it's available in |
Do you suggest casting the result series to the source series datatype explicitly? That sure comes with some performance penalty, is it acceptable in this case? |
I believe casting to the original series dtype is blocked by #4982, I've found no way to check the original dtype as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe casting to the original series dtype is blocked by #4982, I've found no way to check the original dtype as well.
Ok; let's not worry about that for now - we can merge as float and update if necessary.
I've found that quite a few don't seem to line up with DuckDB's implementation (for the DISC variant; CONT seems good), which needs some explanation (and/or a fix).
For example:
import polars as pl
import duckdb as dd
df = pl.DataFrame({"n": [
1.523029,
0.767434,
0.647688,
0.496714,
]})
sql_disc = "SELECT QUANTILE_DISC(n, 0.60) AS q FROM df"
print(dd.sql(sql_disc))
# ┌──────────┐
# │ n │
# │ double │
# ├──────────┤
# │ 0.767434 │
# └──────────┘
print(pl.sql(sql_disc).collect())
# shape: (1, 1)
# ┌──────────┐
# │ n │
# │ --- │
# │ f64 │
# ╞══════════╡
# │ 0.647688 │
# └──────────┘
About half of the quantiles I tried for QUANTILE_DISC
(looping over values from 0.01 to 0.99, in increments of 0.01) returned a different result, using the small example above; potentially an off-by-one error?
I believe DuckDB is using |
Sounds good; I think it's worth returning the same thing here by default 👌 |
After a brief vacation, back to this PR. My conformance test is currently set up the following way: both systems have an integer column df! {
"Year" => [2018, 2018, 2019, 2019, 2020, 2020],
"Country" => ["US", "UK", "US", "UK", "US", "UK"],
"Sales" => [1000, 2000, 3000, 4000, 5000, 6000]
}
DuckDB thinks that 0.7-th discrete quantile of the table is
Whereas my implementation results in DuckDB docs state that the function should return My suggestion is to commit the "lower" interpolation method with my current tests and merge it to the upstream. |
@alexander-beedie wdyt? |
Sounds good to me; have you reported the issue to the DuckDB folks? Doesn't stop us moving forward here, but would be interested in their take on it 👍 |
I've pushed my conformance tests (and also support for 0 and 1 quantiles as they're parsed as integers). I'll open an issue for DuckDB in a few days when I have time to do a more proper test (from DuckDB master branch) and mention this PR in that issue. As for now -- I suggest merging after the final review. |
Can we merge this in the meantime? |
I've also reported the issue to duckdb: duckdb/duckdb#14144 |
@ritchie46 @orlp can anyone look at this in the meantime? |
@pomo-mondreganto duckdb's documentation was wrong. I just fixed it, so you can look up the correct formula in a few minutes, but intuitively it's quite clear what |
That sounds quite reasonable, thanks for the explanation. The formula you provided is indeed correct. However, the polars backend lacks the required "quantile interpolation" method in its execution backend here: let float_idx = ((length - null_count) as f64 - 1.0) * quantile + null_count as f64;
let mut base_idx = match interpol {
QuantileInterpolOptions::Nearest => {
let idx = float_idx.round() as usize;
return (float_idx.round() as usize, 0.0, idx);
},
QuantileInterpolOptions::Lower
| QuantileInterpolOptions::Midpoint
| QuantileInterpolOptions::Linear => float_idx as usize,
QuantileInterpolOptions::Higher => float_idx.ceil() as usize,
};
base_idx = base_idx.clamp(0, length - 1);
let top_idx = f64::ceil(float_idx) as usize; It's easy to see that it'll not be compatible with duckdb's implementation in any case. The problem above was with I could change how this function works in polars (e.g. use the |
@pomo-mondreganto: Apologies for the delay, I was travelling, then sick, and then massively busy at work 😓 So, just to clarify the current state:
If you wanted to break the PR into two pieces (so we don't delay merging the part that works any longer) then we could approve/merge |
Yeah, let's go with a new option. I'll update this PR today to exclude |
Done, |
Frankly, polars should consider changing its discrete interpolation options. They seem to be born out of the same misunderstanding that bore the previously incorrect duckdb documentation (which I had written myself, so no blame here): That discretization should obviously just mean snapping the continuous index to an integer index. In reality, however, none of the currently offered options achieve the one thing you'd most certainly expect from a discrete quantile: That each element is returned with equal chance, for a randomly drawn If you want true quantiles, the only choice you have is what to return for the |
I've actually finished implementing quantile_disc using the new quantile interpolation method "bucket", but waiting for this PR to be merged to have less merge conflicts before opening a new one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One (very!) minor update needed, then this looks good to me 👍
@pomo-mondreganto: All good now - many thanks for this! Based on #18047 (comment) it sounds like we should probably discuss/review the current quantile implementation before moving on to (@soerenwolfers, thanks for the feedback 👌) |
This implements a feature requested in #7227: quantile functions in SQL. I also went ahead and added median function tests as those were missing.