feat: Quantile function in SQL #18047

pomo-mondreganto · 2024-08-05T10:44:53Z

This implements a feature requested in #7227: quantile functions in SQL. I also went ahead and added median function tests as those were missing.

codecov · 2024-08-05T11:14:34Z

Codecov Report

Attention: Patch coverage is 57.89474% with 8 lines in your changes missing coverage. Please review.

Project coverage is 79.78%. Comparing base (fd00ee6) to head (8df8dfe).
Report is 461 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-sql/src/functions.rs	57.89%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #18047      +/-   ##
==========================================
- Coverage   80.46%   79.78%   -0.68%     
==========================================
  Files        1496     1531      +35     
  Lines      197234   208445   +11211     
  Branches     2820     2913      +93     
==========================================
+ Hits       158700   166310    +7610     
- Misses      38012    41584    +3572     
- Partials      522      551      +29

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alexander-beedie

FYI: there is no QUANTILE function in PostgreSQL (or ANSI SQL), which is what we usually aim to match. Can you reformulate this to expose one (or both) of the standard PostgreSQL functions, PERCENTILE_DISC and/or PERCENTILE_CONT¹ instead? I think you will also need to integrate parsing support for WITHIN GROUP to conform to the expected syntax, eg:

SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY sales)
FROM transactions

This would be a great addition to our SQL interface and I'd be very happy to see this capability merged! 😎👍

Looking at DuckDB they expose QUANTILE_DISC and QUANTILE_CONT² in addition to the two percentile functions above, so we should probably match that API for quantile funcs (we have a few other "friendly" DuckDB functions integrated, such as COLUMNS³, so there is precedent).

So, ideally...

PERCENTILE_DISC
PERCENTILE_CONT
QUANTILE_DISC
QUANTILE_CONT

...but if adding the two PERCENTILE_* funcs (and WITHIN GROUP) feels a bit much then I'd also approve just the two QUANTILE_* funcs if they conform to DuckDB syntax (which doesn't require WITHIN GROUP and looks close to your existing implementation).

PostgreSQL Ordered-Set Aggregate Functions:
https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE ↩
DuckDB Quantile Functions:
https://duckdb.org/docs/sql/functions/aggregates.html#quantile_contx-pos
https://duckdb.org/docs/sql/functions/aggregates.html#quantile_discx-pos ↩
Added COLUMNS SQL Support:
https://github.com/pola-rs/polars/pull/17295 ↩

pomo-mondreganto · 2024-08-05T13:04:55Z

I believe I need to gain a deeper understanding of the SQL engine used to implement something more complex than a function itself (required for WITHIN GROUP support), so I'll focus on QUANTILE_* functions in this PR. I also might open a new PR for PERCENTILE_* in the near future.

Related question: I was wondering if it's worth to leave my proposed API as-is (or only leave the QUANTILE(value, interpolation)) to expose all the available interpolation algorithms polars has in addition to _disc (nearest?) and _cont (linear?) quantiles.

alexander-beedie · 2024-08-05T13:25:14Z

I believe I need to gain a deeper understanding of the SQL engine used to implement something more complex than a function itself (required for WITHIN GROUP support), so I'll focus on QUANTILE_* functions in this PR. I also might open a new PR for PERCENTILE_* in the near future.

No problem! And I agree, adding support for the extra syntax is a larger PR 😆

Related question: I was wondering if it's worth to leave my proposed API as-is (or only leave the QUANTILE(value, interpolation)) to expose all the available interpolation algorithms polars has in addition to _disc (nearest?) and _cont (linear?) quantiles.

No, because it has no analog in any standard SQL syntax; we aren't just using the SQL interface to expose custom Polars functions with a custom API, because otherwise that's not really SQL. Since there are SQL examples of QUANTILE_DISC and QUANTILE_CONT (in both DuckDB¹ and Databend², for example) then I think it's reasonable to expose the functionality using the established syntax👌

Note: if there's any disagreement between the DuckDB and Databend syntax then I'd favour DuckDB, as we already have a few of their functions. That way we standardise on PostgreSQL (+ DuckDB if PostgreSQL does not have the desired functions), which helps clarify the expected SQL syntax for users.

DuckDB Quantile Functions:
https://duckdb.org/docs/sql/functions/aggregates.html#quantile_contx-pos
https://duckdb.org/docs/sql/functions/aggregates.html#quantile_discx-pos ↩
Databend Quantile Funcs:
https://docs.databend.com/sql/sql-functions/aggregate-functions/aggregate-quantile-cont
https://docs.databend.com/sql/sql-functions/aggregate-functions/aggregate-quantile-disc ↩

pomo-mondreganto · 2024-08-06T14:25:48Z

Is it ok now? I have an internal feature in the company depending on this and I would like to at least establish the interface to avoid refactoring client's code later. I was also wondering what's the release policy of the rust crates and whether I should stay on git dependencies for the near future?

P.S. I've looked through sqlparser and it seems like WITH GROUP syntax is supported out of the box, so this next PR with percentiles might be coming sooner that I though.

alexander-beedie · 2024-08-06T18:18:13Z

Is it ok now? I have an internal feature in the company depending on this and I would like to at least establish the interface to avoid refactoring client's code later. I was also wondering what's the release policy of the rust crates and whether I should stay on git dependencies for the near future?

The syntax looks good to me. I just gave it a quick sanity-check and it's returning the results I expect; will give it a more thorough test tomorrow (edge-cases, behaviour with nulls, etc).

One thing I spotted - at the moment QUANTILE_DISC returns float values even if the input is integer; as the results are guaranteed to be discrete values from the original data then it should preserve the original dtype, eg if the input is i32 the result should also be i32 (whereas QUANTILE_CONT should indeed always be float).

Example:

df.sql(
  """
  SELECT  
    QUANTILE_DISC(Sales, 0.10) as q10,
    QUANTILE_DISC(Sales, 0.25) as q25,
    QUANTILE_DISC(Sales, 0.40) as q40,
    QUANTILE_DISC(Sales, 0.55) as q55,
    QUANTILE_DISC(Sales, 0.70) as q70,
    QUANTILE_DISC(Sales, 0.85) as q85
  FROM self 
  """
)
# ┌────────┬────────┬────────┬────────┬────────┬────────┐
# │ q10    ┆ q25    ┆ q40    ┆ q55    ┆ q70    ┆ q85    │
# │ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
# │ f64    ┆ f64    ┆ f64    ┆ f64    ┆ f64    ┆ f64    │  << should all be int
# ╞════════╪════════╪════════╪════════╪════════╪════════╡
# │ 1000.0 ┆ 2000.0 ┆ 3000.0 ┆ 3000.0 ┆ 4000.0 ┆ 5000.0 │
# └────────┴────────┴────────┴────────┴────────┴────────┘

If you can fix that (and further testing look good tomorrow) then this should be good merge shortly 👍

P.S. I've looked through sqlparser and it seems like WITH GROUP syntax is supported out of the box, so this next PR with percentiles might be coming sooner that I though.

Yes, it's available in sqlparser-rs, we just don't have any integration on our side yet.

pomo-mondreganto · 2024-08-06T20:00:36Z

Do you suggest casting the result series to the source series datatype explicitly? That sure comes with some performance penalty, is it acceptable in this case?

pomo-mondreganto · 2024-08-07T07:19:58Z

I believe casting to the original series dtype is blocked by #4982, I've found no way to check the original dtype as well.

alexander-beedie

I believe casting to the original series dtype is blocked by #4982, I've found no way to check the original dtype as well.

Ok; let's not worry about that for now - we can merge as float and update if necessary.

I've found that quite a few don't seem to line up with DuckDB's implementation (for the DISC variant; CONT seems good), which needs some explanation (and/or a fix).

For example:

import polars as pl
import duckdb as dd

df = pl.DataFrame({"n": [
    1.523029,
    0.767434,
    0.647688,
    0.496714,
]})

sql_disc = "SELECT QUANTILE_DISC(n, 0.60) AS q FROM df"
print(dd.sql(sql_disc))
# ┌──────────┐
# │    n     │
# │  double  │
# ├──────────┤
# │ 0.767434 │
# └──────────┘

print(pl.sql(sql_disc).collect())
# shape: (1, 1)
# ┌──────────┐
# │ n        │
# │ ---      │
# │ f64      │
# ╞══════════╡
# │ 0.647688 │
# └──────────┘

About half of the quantiles I tried for QUANTILE_DISC (looping over values from 0.01 to 0.99, in increments of 0.01) returned a different result, using the small example above; potentially an off-by-one error?

pomo-mondreganto · 2024-08-08T11:55:16Z

I believe DuckDB is using nearest quantile where's my impl uses lower. That shouldn't be a problem, I'll change it to 'nearest' and add some compliance test cases then.

alexander-beedie · 2024-08-09T04:47:10Z

I believe DuckDB is using nearest quantile where's my impl uses lower. That shouldn't be a problem, I'll change it to 'nearest' and add some compliance test cases then.

Sounds good; I think it's worth returning the same thing here by default 👌

pomo-mondreganto · 2024-09-05T08:04:27Z

After a brief vacation, back to this PR. My conformance test is currently set up the following way: both systems have an integer column Sales consisting of 6 numbers 1000 * (i + 1):

df! {
  "Year" => [2018, 2018, 2019, 2019, 2020, 2020],
  "Country" => ["US", "UK", "US", "UK", "US", "UK"],
  "Sales" => [1000, 2000, 3000, 4000, 5000, 6000]
}

D select * from kek;
┌───────┐
│ sales │
│ int32 │
├───────┤
│  1000 │
│  2000 │
│  3000 │
│  4000 │
│  5000 │
│  6000 │
└───────┘

DuckDB thinks that 0.7-th discrete quantile of the table is 5000:

D select quantile_disc(sales, 0.7) from kek;
┌───────────────────────────┐
│ quantile_disc(sales, 0.7) │
│           int32           │
├───────────────────────────┤
│                      5000 │
└───────────────────────────┘

Whereas my implementation results in 4000 with lower interpolation method.

DuckDB docs state that the function should return the floor(pos * (n_nonnull_values - 1))th (zero-indexed) element, so for our case it's floor(0.7 * (6 - 1)), which is 3, so it should return 3rd 0-indexed element, which is 4000 and also what my implementation does. I believe this to be a bug in DuckDB.

My suggestion is to commit the "lower" interpolation method with my current tests and merge it to the upstream.

pomo-mondreganto · 2024-09-06T06:22:30Z

@alexander-beedie wdyt?

alexander-beedie · 2024-09-07T11:53:46Z

@alexander-beedie wdyt?

Sounds good to me; have you reported the issue to the DuckDB folks? Doesn't stop us moving forward here, but would be interested in their take on it 👍

pomo-mondreganto · 2024-09-09T11:36:16Z

I've pushed my conformance tests (and also support for 0 and 1 quantiles as they're parsed as integers). I'll open an issue for DuckDB in a few days when I have time to do a more proper test (from DuckDB master branch) and mention this PR in that issue. As for now -- I suggest merging after the final review.

pomo-mondreganto · 2024-09-16T15:32:14Z

Can we merge this in the meantime?

pomo-mondreganto · 2024-09-19T07:09:40Z

@alexander-beedie :(

pomo-mondreganto · 2024-09-27T07:49:06Z

I've also reported the issue to duckdb: duckdb/duckdb#14144

pomo-mondreganto · 2024-09-27T07:50:31Z

@ritchie46 @orlp can anyone look at this in the meantime?

soerenwolfers · 2024-09-27T16:56:22Z

@pomo-mondreganto duckdb's documentation was wrong. I just fixed it, so you can look up the correct formula in a few minutes, but intuitively it's quite clear what quantile_disc(x, q) is trying to do: Split up the interval [0, 1] in as many equal-length intervals as you have elements, then return the ith element if q is in the ith interval. In your case, the intervals are [0,1/6], (1/6, 2/6], (2/6, 3/6],(3/6, 4/6],(4/6, 5/6], (5/6, 1] so without going into detail how edge cases are resolved (left open right closed intervals) it's clear that 0.7 is in the penultimate interval and the answer therefore has to be the penultimate item, 5000.

pomo-mondreganto · 2024-09-29T10:15:33Z

That sounds quite reasonable, thanks for the explanation. The formula you provided is indeed correct. However, the polars backend lacks the required "quantile interpolation" method in its execution backend here:

    let float_idx = ((length - null_count) as f64 - 1.0) * quantile + null_count as f64;
    let mut base_idx = match interpol {
        QuantileInterpolOptions::Nearest => {
            let idx = float_idx.round() as usize;
            return (float_idx.round() as usize, 0.0, idx);
        },
        QuantileInterpolOptions::Lower
        | QuantileInterpolOptions::Midpoint
        | QuantileInterpolOptions::Linear => float_idx as usize,
        QuantileInterpolOptions::Higher => float_idx.ceil() as usize,
    };

    base_idx = base_idx.clamp(0, length - 1);
    let top_idx = f64::ceil(float_idx) as usize;

It's easy to see that it'll not be compatible with duckdb's implementation in any case. The problem above was with q=0.7 and QuantileInterpolOptions::Linear, and switching to Lower or Higher would break the q=0.1 case as (6 - 1) * 0.1 = 0.5, both round and ceil would return 1, whereas duckdb returns 0.

I could change how this function works in polars (e.g. use the ((length - null_count) as f64) * quantile - 1.0 for float_idx instead, but that would mean a backward-incompatible change in core and I sure would need an approval from any code owner beforehand. Please respond, what do you think?

alexander-beedie · 2024-09-30T12:13:04Z

@pomo-mondreganto: Apologies for the delay, I was travelling, then sick, and then massively busy at work 😓

So, just to clarify the current state:

We would need a new option in QuantileInterpolOptions to match DuckDB's implementation? Or a more substantial change is required?
If the former, that doesn't sound unreasonable, as we could expose the new option everywhere.
The QUANTILE_CONT function matches the expected values, only QUANTILE_DISC needs a decision.

If you wanted to break the PR into two pieces (so we don't delay merging the part that works any longer) then we could approve/merge QUANTILE_CONT, and make a separate PR/Issue for QUANTILE_DISC? 👍

pomo-mondreganto · 2024-10-01T10:06:52Z

Yeah, let's go with a new option. I'll update this PR today to exclude quantile_disc to merge quantile_cont and return with quantile_disc when it's ready. Seems like I'll need to change a ton of code to handle the new option, will take a while.

pomo-mondreganto · 2024-10-01T11:52:53Z

Done, QUANTILE_CONT should be ready to merge now, @alexander-beedie

soerenwolfers · 2024-10-02T09:13:47Z

Frankly, polars should consider changing its discrete interpolation options. They seem to be born out of the same misunderstanding that bore the previously incorrect duckdb documentation (which I had written myself, so no blame here): That discretization should obviously just mean snapping the continuous index to an integer index. In reality, however, none of the currently offered options achieve the one thing you'd most certainly expect from a discrete quantile: That each element is returned with equal chance, for a randomly drawn q in [0, 1]. In other words, they fail to achieve the main goal of quantiles: To partition the values of the given distribution into equally "large" subgroups. In fact, the Lower and Higher options essentially never return one of the elements, whereas the Nearest option returns the two extreme elements half as frequently as the other elements.

If you want true quantiles, the only choice you have is what to return for the q-values "at the internal boundaries" of the partition of [0, 1].

pomo-mondreganto · 2024-10-02T13:45:18Z

I've actually finished implementing quantile_disc using the new quantile interpolation method "bucket", but waiting for this PR to be merged to have less merge conflicts before opening a new one.

alexander-beedie

One (very!) minor update needed, then this looks good to me 👍

crates/polars-sql/src/functions.rs

alexander-beedie · 2024-10-08T09:02:18Z

@pomo-mondreganto: All good now - many thanks for this!

Based on #18047 (comment) it sounds like we should probably discuss/review the current quantile implementation before moving on to QUANTILE_DISC - sounds like you have already prepared a PR that could improve things, which would be great (and I can ask some of the other devs to take a look at a new PR) 🤔

(@soerenwolfers, thanks for the feedback 👌)

pomo-mondreganto requested review from ritchie46, orlp, c-peters and alexander-beedie as code owners August 5, 2024 10:44

github-actions bot added the title needs formatting label Aug 5, 2024

pomo-mondreganto changed the title ~~Implement quantile function in SQL~~ Quantile function in SQL Aug 5, 2024

pomo-mondreganto added 2 commits August 5, 2024 13:53

Implement quantile function in SQL

b211cd1

fix lint

bb4cf5a

pomo-mondreganto force-pushed the feature/sql-quantiles branch from 967108d to bb4cf5a Compare August 5, 2024 10:53

coastalwhite changed the title ~~Quantile function in SQL~~ feat: Quantile function in SQL Aug 5, 2024

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Aug 5, 2024

alexander-beedie requested changes Aug 5, 2024

View reviewed changes

alexander-beedie added the A-sql Area: Polars SQL functionality label Aug 5, 2024

pomo-mondreganto added 2 commits August 6, 2024 09:58

Rewrite quantile function to quantile_disc and quantile_cont

f7f3f52

Remove quantile interpol options impl

c2e1e9e

pomo-mondreganto requested a review from alexander-beedie August 6, 2024 07:30

alexander-beedie requested changes Aug 8, 2024

View reviewed changes

Add conformance tests, support integer quantiles (0 and 1)

a2cceac

pomo-mondreganto requested a review from alexander-beedie September 9, 2024 14:29

Remove quantile_disc implementation

c369f9a

alexander-beedie requested changes Oct 7, 2024

View reviewed changes

crates/polars-sql/src/functions.rs Outdated Show resolved Hide resolved

Fix review issues

8df8dfe

pomo-mondreganto requested a review from alexander-beedie October 8, 2024 08:26

alexander-beedie approved these changes Oct 8, 2024

View reviewed changes

alexander-beedie merged commit 9dada18 into pola-rs:main Oct 8, 2024
20 checks passed

pomo-mondreganto mentioned this pull request Oct 8, 2024

feat: New quantile interpolation method & QUANTILE_DISC function in SQL #19139

Merged

pomo-mondreganto deleted the feature/sql-quantiles branch October 8, 2024 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Quantile function in SQL #18047

feat: Quantile function in SQL #18047

pomo-mondreganto commented Aug 5, 2024 •

edited

Loading

codecov bot commented Aug 5, 2024 •

edited

Loading

alexander-beedie left a comment •

edited

Loading

pomo-mondreganto commented Aug 5, 2024

alexander-beedie commented Aug 5, 2024 •

edited

Loading

pomo-mondreganto commented Aug 6, 2024 •

edited

Loading

alexander-beedie commented Aug 6, 2024 •

edited

Loading

pomo-mondreganto commented Aug 6, 2024

pomo-mondreganto commented Aug 7, 2024

alexander-beedie left a comment •

edited

Loading

pomo-mondreganto commented Aug 8, 2024

alexander-beedie commented Aug 9, 2024

pomo-mondreganto commented Sep 5, 2024

pomo-mondreganto commented Sep 6, 2024

alexander-beedie commented Sep 7, 2024 •

edited

Loading

pomo-mondreganto commented Sep 9, 2024

pomo-mondreganto commented Sep 16, 2024

pomo-mondreganto commented Sep 19, 2024

pomo-mondreganto commented Sep 27, 2024

pomo-mondreganto commented Sep 27, 2024

soerenwolfers commented Sep 27, 2024

pomo-mondreganto commented Sep 29, 2024 •

edited

Loading

alexander-beedie commented Sep 30, 2024 •

edited

Loading

pomo-mondreganto commented Oct 1, 2024

pomo-mondreganto commented Oct 1, 2024

soerenwolfers commented Oct 2, 2024 •

edited

Loading

pomo-mondreganto commented Oct 2, 2024

alexander-beedie left a comment •

edited

Loading

alexander-beedie commented Oct 8, 2024 •

edited

Loading

feat: Quantile function in SQL #18047

feat: Quantile function in SQL #18047

Conversation

pomo-mondreganto commented Aug 5, 2024 • edited Loading

codecov bot commented Aug 5, 2024 • edited Loading

Codecov Report

alexander-beedie left a comment • edited Loading

Choose a reason for hiding this comment

Footnotes

pomo-mondreganto commented Aug 5, 2024

alexander-beedie commented Aug 5, 2024 • edited Loading

Footnotes

pomo-mondreganto commented Aug 6, 2024 • edited Loading

alexander-beedie commented Aug 6, 2024 • edited Loading

pomo-mondreganto commented Aug 6, 2024

pomo-mondreganto commented Aug 7, 2024

alexander-beedie left a comment • edited Loading

Choose a reason for hiding this comment

pomo-mondreganto commented Aug 8, 2024

alexander-beedie commented Aug 9, 2024

pomo-mondreganto commented Sep 5, 2024

pomo-mondreganto commented Sep 6, 2024

alexander-beedie commented Sep 7, 2024 • edited Loading

pomo-mondreganto commented Sep 9, 2024

pomo-mondreganto commented Sep 16, 2024

pomo-mondreganto commented Sep 19, 2024

pomo-mondreganto commented Sep 27, 2024

pomo-mondreganto commented Sep 27, 2024

soerenwolfers commented Sep 27, 2024

pomo-mondreganto commented Sep 29, 2024 • edited Loading

alexander-beedie commented Sep 30, 2024 • edited Loading

pomo-mondreganto commented Oct 1, 2024

pomo-mondreganto commented Oct 1, 2024

soerenwolfers commented Oct 2, 2024 • edited Loading

pomo-mondreganto commented Oct 2, 2024

alexander-beedie left a comment • edited Loading

Choose a reason for hiding this comment

alexander-beedie commented Oct 8, 2024 • edited Loading

pomo-mondreganto commented Aug 5, 2024 •

edited

Loading

codecov bot commented Aug 5, 2024 •

edited

Loading

alexander-beedie left a comment •

edited

Loading

alexander-beedie commented Aug 5, 2024 •

edited

Loading

pomo-mondreganto commented Aug 6, 2024 •

edited

Loading

alexander-beedie commented Aug 6, 2024 •

edited

Loading

alexander-beedie left a comment •

edited

Loading

alexander-beedie commented Sep 7, 2024 •

edited

Loading

pomo-mondreganto commented Sep 29, 2024 •

edited

Loading

alexander-beedie commented Sep 30, 2024 •

edited

Loading

soerenwolfers commented Oct 2, 2024 •

edited

Loading

alexander-beedie left a comment •

edited

Loading

alexander-beedie commented Oct 8, 2024 •

edited

Loading