Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Quantile function in SQL #18047

Merged

Conversation

pomo-mondreganto
Copy link
Contributor

@pomo-mondreganto pomo-mondreganto commented Aug 5, 2024

This implements a feature requested in #7227: quantile functions in SQL. I also went ahead and added median function tests as those were missing.

@pomo-mondreganto pomo-mondreganto changed the title Implement quantile function in SQL Quantile function in SQL Aug 5, 2024
Copy link

codecov bot commented Aug 5, 2024

Codecov Report

Attention: Patch coverage is 57.89474% with 8 lines in your changes missing coverage. Please review.

Project coverage is 79.78%. Comparing base (fd00ee6) to head (8df8dfe).
Report is 461 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-sql/src/functions.rs 57.89% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18047      +/-   ##
==========================================
- Coverage   80.46%   79.78%   -0.68%     
==========================================
  Files        1496     1531      +35     
  Lines      197234   208445   +11211     
  Branches     2820     2913      +93     
==========================================
+ Hits       158700   166310    +7610     
- Misses      38012    41584    +3572     
- Partials      522      551      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coastalwhite coastalwhite changed the title Quantile function in SQL feat: Quantile function in SQL Aug 5, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Aug 5, 2024
Copy link
Collaborator

@alexander-beedie alexander-beedie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: there is no QUANTILE function in PostgreSQL (or ANSI SQL), which is what we usually aim to match. Can you reformulate this to expose one (or both) of the standard PostgreSQL functions, PERCENTILE_DISC and/or PERCENTILE_CONT1 instead? I think you will also need to integrate parsing support for WITHIN GROUP to conform to the expected syntax, eg:

SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY sales)
FROM transactions

This would be a great addition to our SQL interface and I'd be very happy to see this capability merged! 😎👍

Looking at DuckDB they expose QUANTILE_DISC and QUANTILE_CONT2 in addition to the two percentile functions above, so we should probably match that API for quantile funcs (we have a few other "friendly" DuckDB functions integrated, such as COLUMNS3, so there is precedent).

So, ideally...

  • PERCENTILE_DISC
  • PERCENTILE_CONT
  • QUANTILE_DISC
  • QUANTILE_CONT

...but if adding the two PERCENTILE_* funcs (and WITHIN GROUP) feels a bit much then I'd also approve just the two QUANTILE_* funcs if they conform to DuckDB syntax (which doesn't require WITHIN GROUP and looks close to your existing implementation).

Footnotes

  1. PostgreSQL Ordered-Set Aggregate Functions:
    https://www.postgresql.org/docs/current/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE

  2. DuckDB Quantile Functions:
    https://duckdb.org/docs/sql/functions/aggregates.html#quantile_contx-pos
    https://duckdb.org/docs/sql/functions/aggregates.html#quantile_discx-pos

  3. Added COLUMNS SQL Support:
    https://github.com/pola-rs/polars/pull/17295

@alexander-beedie alexander-beedie added the A-sql Area: Polars SQL functionality label Aug 5, 2024
@pomo-mondreganto
Copy link
Contributor Author

I believe I need to gain a deeper understanding of the SQL engine used to implement something more complex than a function itself (required for WITHIN GROUP support), so I'll focus on QUANTILE_* functions in this PR. I also might open a new PR for PERCENTILE_* in the near future.

Related question: I was wondering if it's worth to leave my proposed API as-is (or only leave the QUANTILE(value, interpolation)) to expose all the available interpolation algorithms polars has in addition to _disc (nearest?) and _cont (linear?) quantiles.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 5, 2024

I believe I need to gain a deeper understanding of the SQL engine used to implement something more complex than a function itself (required for WITHIN GROUP support), so I'll focus on QUANTILE_* functions in this PR. I also might open a new PR for PERCENTILE_* in the near future.

No problem! And I agree, adding support for the extra syntax is a larger PR 😆

Related question: I was wondering if it's worth to leave my proposed API as-is (or only leave the QUANTILE(value, interpolation)) to expose all the available interpolation algorithms polars has in addition to _disc (nearest?) and _cont (linear?) quantiles.

No, because it has no analog in any standard SQL syntax; we aren't just using the SQL interface to expose custom Polars functions with a custom API, because otherwise that's not really SQL. Since there are SQL examples of QUANTILE_DISC and QUANTILE_CONT (in both DuckDB1 and Databend2, for example) then I think it's reasonable to expose the functionality using the established syntax👌

Note: if there's any disagreement between the DuckDB and Databend syntax then I'd favour DuckDB, as we already have a few of their functions. That way we standardise on PostgreSQL (+ DuckDB if PostgreSQL does not have the desired functions), which helps clarify the expected SQL syntax for users.

Footnotes

  1. DuckDB Quantile Functions:
    https://duckdb.org/docs/sql/functions/aggregates.html#quantile_contx-pos
    https://duckdb.org/docs/sql/functions/aggregates.html#quantile_discx-pos

  2. Databend Quantile Funcs:
    https://docs.databend.com/sql/sql-functions/aggregate-functions/aggregate-quantile-cont
    https://docs.databend.com/sql/sql-functions/aggregate-functions/aggregate-quantile-disc

@pomo-mondreganto
Copy link
Contributor Author

pomo-mondreganto commented Aug 6, 2024

Is it ok now? I have an internal feature in the company depending on this and I would like to at least establish the interface to avoid refactoring client's code later. I was also wondering what's the release policy of the rust crates and whether I should stay on git dependencies for the near future?

P.S. I've looked through sqlparser and it seems like WITH GROUP syntax is supported out of the box, so this next PR with percentiles might be coming sooner that I though.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 6, 2024

Is it ok now? I have an internal feature in the company depending on this and I would like to at least establish the interface to avoid refactoring client's code later. I was also wondering what's the release policy of the rust crates and whether I should stay on git dependencies for the near future?

The syntax looks good to me. I just gave it a quick sanity-check and it's returning the results I expect; will give it a more thorough test tomorrow (edge-cases, behaviour with nulls, etc).

One thing I spotted - at the moment QUANTILE_DISC returns float values even if the input is integer; as the results are guaranteed to be discrete values from the original data then it should preserve the original dtype, eg if the input is i32 the result should also be i32 (whereas QUANTILE_CONT should indeed always be float).

Example:

df.sql(
  """
  SELECT  
    QUANTILE_DISC(Sales, 0.10) as q10,
    QUANTILE_DISC(Sales, 0.25) as q25,
    QUANTILE_DISC(Sales, 0.40) as q40,
    QUANTILE_DISC(Sales, 0.55) as q55,
    QUANTILE_DISC(Sales, 0.70) as q70,
    QUANTILE_DISC(Sales, 0.85) as q85
  FROM self 
  """
)
# ┌────────┬────────┬────────┬────────┬────────┬────────┐
# │ q10    ┆ q25    ┆ q40    ┆ q55    ┆ q70    ┆ q85    │
# │ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
# │ f64    ┆ f64    ┆ f64    ┆ f64    ┆ f64    ┆ f64    │  << should all be int
# ╞════════╪════════╪════════╪════════╪════════╪════════╡
# │ 1000.0 ┆ 2000.0 ┆ 3000.0 ┆ 3000.0 ┆ 4000.0 ┆ 5000.0 │
# └────────┴────────┴────────┴────────┴────────┴────────┘

If you can fix that (and further testing look good tomorrow) then this should be good merge shortly 👍

P.S. I've looked through sqlparser and it seems like WITH GROUP syntax is supported out of the box, so this next PR with percentiles might be coming sooner that I though.

Yes, it's available in sqlparser-rs, we just don't have any integration on our side yet.

@pomo-mondreganto
Copy link
Contributor Author

Do you suggest casting the result series to the source series datatype explicitly? That sure comes with some performance penalty, is it acceptable in this case?

@pomo-mondreganto
Copy link
Contributor Author

I believe casting to the original series dtype is blocked by #4982, I've found no way to check the original dtype as well.

Copy link
Collaborator

@alexander-beedie alexander-beedie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe casting to the original series dtype is blocked by #4982, I've found no way to check the original dtype as well.

Ok; let's not worry about that for now - we can merge as float and update if necessary.

I've found that quite a few don't seem to line up with DuckDB's implementation (for the DISC variant; CONT seems good), which needs some explanation (and/or a fix).

For example:

import polars as pl
import duckdb as dd

df = pl.DataFrame({"n": [
    1.523029,
    0.767434,
    0.647688,
    0.496714,
]})
sql_disc = "SELECT QUANTILE_DISC(n, 0.60) AS q FROM df"
print(dd.sql(sql_disc))
# ┌──────────┐
# │    n     │
# │  double  │
# ├──────────┤
# │ 0.767434 │
# └──────────┘

print(pl.sql(sql_disc).collect())
# shape: (1, 1)
# ┌──────────┐
# │ n        │
# │ ---      │
# │ f64      │
# ╞══════════╡
# │ 0.647688 │
# └──────────┘

About half of the quantiles I tried for QUANTILE_DISC (looping over values from 0.01 to 0.99, in increments of 0.01) returned a different result, using the small example above; potentially an off-by-one error?

@pomo-mondreganto
Copy link
Contributor Author

I believe DuckDB is using nearest quantile where's my impl uses lower. That shouldn't be a problem, I'll change it to 'nearest' and add some compliance test cases then.

@alexander-beedie
Copy link
Collaborator

I believe DuckDB is using nearest quantile where's my impl uses lower. That shouldn't be a problem, I'll change it to 'nearest' and add some compliance test cases then.

Sounds good; I think it's worth returning the same thing here by default 👌

@pomo-mondreganto
Copy link
Contributor Author

After a brief vacation, back to this PR. My conformance test is currently set up the following way: both systems have an integer column Sales consisting of 6 numbers 1000 * (i + 1):

df! {
  "Year" => [2018, 2018, 2019, 2019, 2020, 2020],
  "Country" => ["US", "UK", "US", "UK", "US", "UK"],
  "Sales" => [1000, 2000, 3000, 4000, 5000, 6000]
}
D select * from kek;
┌───────┐
│ sales │
│ int32 │
├───────┤
│  1000 │
│  2000 │
│  3000 │
│  4000 │
│  5000 │
│  6000 │
└───────┘

DuckDB thinks that 0.7-th discrete quantile of the table is 5000:

D select quantile_disc(sales, 0.7) from kek;
┌───────────────────────────┐
│ quantile_disc(sales, 0.7) │
│           int32           │
├───────────────────────────┤
│                      5000 │
└───────────────────────────┘

Whereas my implementation results in 4000 with lower interpolation method.

DuckDB docs state that the function should return the floor(pos * (n_nonnull_values - 1))th (zero-indexed) element, so for our case it's floor(0.7 * (6 - 1)), which is 3, so it should return 3rd 0-indexed element, which is 4000 and also what my implementation does. I believe this to be a bug in DuckDB.

My suggestion is to commit the "lower" interpolation method with my current tests and merge it to the upstream.

@pomo-mondreganto
Copy link
Contributor Author

@alexander-beedie wdyt?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 7, 2024

@alexander-beedie wdyt?

Sounds good to me; have you reported the issue to the DuckDB folks? Doesn't stop us moving forward here, but would be interested in their take on it 👍

@pomo-mondreganto
Copy link
Contributor Author

I've pushed my conformance tests (and also support for 0 and 1 quantiles as they're parsed as integers). I'll open an issue for DuckDB in a few days when I have time to do a more proper test (from DuckDB master branch) and mention this PR in that issue. As for now -- I suggest merging after the final review.

@pomo-mondreganto
Copy link
Contributor Author

Can we merge this in the meantime?

@pomo-mondreganto
Copy link
Contributor Author

@alexander-beedie :(

@pomo-mondreganto
Copy link
Contributor Author

I've also reported the issue to duckdb: duckdb/duckdb#14144

@pomo-mondreganto
Copy link
Contributor Author

@ritchie46 @orlp can anyone look at this in the meantime?

@soerenwolfers
Copy link

@pomo-mondreganto duckdb's documentation was wrong. I just fixed it, so you can look up the correct formula in a few minutes, but intuitively it's quite clear what quantile_disc(x, q) is trying to do: Split up the interval [0, 1] in as many equal-length intervals as you have elements, then return the ith element if q is in the ith interval. In your case, the intervals are [0,1/6], (1/6, 2/6], (2/6, 3/6],(3/6, 4/6],(4/6, 5/6], (5/6, 1] so without going into detail how edge cases are resolved (left open right closed intervals) it's clear that 0.7 is in the penultimate interval and the answer therefore has to be the penultimate item, 5000.

@pomo-mondreganto
Copy link
Contributor Author

pomo-mondreganto commented Sep 29, 2024

That sounds quite reasonable, thanks for the explanation. The formula you provided is indeed correct. However, the polars backend lacks the required "quantile interpolation" method in its execution backend here:

    let float_idx = ((length - null_count) as f64 - 1.0) * quantile + null_count as f64;
    let mut base_idx = match interpol {
        QuantileInterpolOptions::Nearest => {
            let idx = float_idx.round() as usize;
            return (float_idx.round() as usize, 0.0, idx);
        },
        QuantileInterpolOptions::Lower
        | QuantileInterpolOptions::Midpoint
        | QuantileInterpolOptions::Linear => float_idx as usize,
        QuantileInterpolOptions::Higher => float_idx.ceil() as usize,
    };

    base_idx = base_idx.clamp(0, length - 1);
    let top_idx = f64::ceil(float_idx) as usize;

It's easy to see that it'll not be compatible with duckdb's implementation in any case. The problem above was with q=0.7 and QuantileInterpolOptions::Linear, and switching to Lower or Higher would break the q=0.1 case as (6 - 1) * 0.1 = 0.5, both round and ceil would return 1, whereas duckdb returns 0.

I could change how this function works in polars (e.g. use the ((length - null_count) as f64) * quantile - 1.0 for float_idx instead, but that would mean a backward-incompatible change in core and I sure would need an approval from any code owner beforehand. Please respond, what do you think?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Sep 30, 2024

@pomo-mondreganto: Apologies for the delay, I was travelling, then sick, and then massively busy at work 😓

So, just to clarify the current state:

  • We would need a new option in QuantileInterpolOptions to match DuckDB's implementation? Or a more substantial change is required?
  • If the former, that doesn't sound unreasonable, as we could expose the new option everywhere.
  • The QUANTILE_CONT function matches the expected values, only QUANTILE_DISC needs a decision.

If you wanted to break the PR into two pieces (so we don't delay merging the part that works any longer) then we could approve/merge QUANTILE_CONT, and make a separate PR/Issue for QUANTILE_DISC? 👍

@pomo-mondreganto
Copy link
Contributor Author

Yeah, let's go with a new option. I'll update this PR today to exclude quantile_disc to merge quantile_cont and return with quantile_disc when it's ready. Seems like I'll need to change a ton of code to handle the new option, will take a while.

@pomo-mondreganto
Copy link
Contributor Author

Done, QUANTILE_CONT should be ready to merge now, @alexander-beedie

@soerenwolfers
Copy link

soerenwolfers commented Oct 2, 2024

Frankly, polars should consider changing its discrete interpolation options. They seem to be born out of the same misunderstanding that bore the previously incorrect duckdb documentation (which I had written myself, so no blame here): That discretization should obviously just mean snapping the continuous index to an integer index. In reality, however, none of the currently offered options achieve the one thing you'd most certainly expect from a discrete quantile: That each element is returned with equal chance, for a randomly drawn q in [0, 1]. In other words, they fail to achieve the main goal of quantiles: To partition the values of the given distribution into equally "large" subgroups. In fact, the Lower and Higher options essentially never return one of the elements, whereas the Nearest option returns the two extreme elements half as frequently as the other elements.

If you want true quantiles, the only choice you have is what to return for the q-values "at the internal boundaries" of the partition of [0, 1].

@pomo-mondreganto
Copy link
Contributor Author

I've actually finished implementing quantile_disc using the new quantile interpolation method "bucket", but waiting for this PR to be merged to have less merge conflicts before opening a new one.

Copy link
Collaborator

@alexander-beedie alexander-beedie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One (very!) minor update needed, then this looks good to me 👍

crates/polars-sql/src/functions.rs Outdated Show resolved Hide resolved
@alexander-beedie alexander-beedie merged commit 9dada18 into pola-rs:main Oct 8, 2024
20 checks passed
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 8, 2024

@pomo-mondreganto: All good now - many thanks for this!

Based on #18047 (comment) it sounds like we should probably discuss/review the current quantile implementation before moving on to QUANTILE_DISC - sounds like you have already prepared a PR that could improve things, which would be great (and I can ask some of the other devs to take a look at a new PR) 🤔

(@soerenwolfers, thanks for the feedback 👌)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql Area: Polars SQL functionality enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants