-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sort in Aggregation causes incorrect results #5314
Comments
As pointed out on discord by ghuls, the line |
There might still be an issue with the In [53]: df = pl.DataFrame({"A": ["a", "a", "a", "b", "b", "a"], "B": [1, 2, 3, 4, 5, 6]})
In [54]: df
Out[54]:
shape: (6, 2)
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 6 │
└─────┴─────┘
In [55]: df.lazy().groupby("A").agg([pl.col("B").alias("b"), pl.col("B").sum().alias("b_sum"), pl.col("B").min().alias("b_min"), pl.col("B").max().alias("b_max"), pl.col("B").list().alias("b_list"), pl.col("B").list().sum().alias("b_list_sum"), pl.col("B
...: ").list().min().alias("b_list_min"), pl.col("B").list().max().alias("b_list_max"),]).collect()
Out[55]:
shape: (2, 9)
┌─────┬───────────────┬───────┬───────┬───────┬───────────────┬────────────┬────────────┬────────────┐
│ A ┆ b ┆ b_sum ┆ b_min ┆ b_max ┆ b_list ┆ b_list_sum ┆ b_list_min ┆ b_list_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[i64] ┆ i64 ┆ i64 ┆ i64 ┆ list[i64] ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════════╪═══════╪═══════╪═══════╪═══════════════╪════════════╪════════════╪════════════╡
│ b ┆ [4, 5] ┆ 9 ┆ 4 ┆ 5 ┆ [4, 5] ┆ 5 ┆ 2 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ [1, 2, ... 6] ┆ 12 ┆ 1 ┆ 6 ┆ [1, 2, ... 6] ┆ 16 ┆ 1 ┆ 6 │
└─────┴───────────────┴───────┴───────┴───────┴───────────────┴────────────┴────────────┴────────────┘
In [56]: df.lazy().groupby("A").agg([pl.col("B").alias("b"), pl.col("B").sum().alias("b_sum"), pl.col("B").min().alias("b_min"), pl.col("B").max().alias("b_max"), pl.col("B").list().alias("b_list"), pl.col("B").list().sum().alias("b_list_sum"), pl.col("B
...: ").list().min().alias("b_list_min"), pl.col("B").list().max().alias("b_list_max"),]).collect(type_coercion=False, predicate_pushdown=False, projection_pushdown=False, no_optimization=True, slice_pushdown=False, common_subplan_elimination=False)
Out[56]:
shape: (2, 9)
┌─────┬───────────────┬───────┬───────┬───────┬───────────────┬────────────┬────────────┬────────────┐
│ A ┆ b ┆ b_sum ┆ b_min ┆ b_max ┆ b_list ┆ b_list_sum ┆ b_list_min ┆ b_list_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[i64] ┆ i64 ┆ i64 ┆ i64 ┆ list[i64] ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════════╪═══════╪═══════╪═══════╪═══════════════╪════════════╪════════════╪════════════╡
│ a ┆ [1, 2, ... 6] ┆ 12 ┆ 1 ┆ 6 ┆ [1, 2, ... 6] ┆ 11 ┆ 1 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ [4, 5] ┆ 9 ┆ 4 ┆ 5 ┆ [4, 5] ┆ 10 ┆ 4 ┆ 6 │
└─────┴───────────────┴───────┴───────┴───────┴───────────────┴────────────┴────────────┴────────────┘ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
Using a
sort
expression in an aggregation leads to random and incorrect results.This sounds similar to this issue, but as far as I see the fix is not yet published on pypi. Also the fix is titled
fix[rust]: unset sorted flag on mutation
, which intuitively seems to not be the issue I've run in here, as in my case the dataframe to be aggregated is not sorted.Reproducible example
This results sometimes in
and sometimes in
Expected behavior
The result should be
Installed versions
The text was updated successfully, but these errors were encountered: