Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort in Aggregation causes incorrect results #5314

Closed
2 tasks done
syntonym opened this issue Oct 24, 2022 · 2 comments · Fixed by #5317
Closed
2 tasks done

Sort in Aggregation causes incorrect results #5314

syntonym opened this issue Oct 24, 2022 · 2 comments · Fixed by #5317
Labels
bug Something isn't working python Related to Python Polars

Comments

@syntonym
Copy link

syntonym commented Oct 24, 2022

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Using a sort expression in an aggregation leads to random and incorrect results.

This sounds similar to this issue, but as far as I see the fix is not yet published on pypi. Also the fix is titled fix[rust]: unset sorted flag on mutation, which intuitively seems to not be the issue I've run in here, as in my case the dataframe to be aggregated is not sorted.

Reproducible example

df = pl.DataFrame({"A": ["a", "a", "a", "b", "b", "a"], "B": [1, 2, 3, 4, 5, 6]})
df.groupby("A").agg(pl.col("B").list().sort(reverse=True))

This results sometimes in

A B
a [5, 3, 2, 1]
b [6, 4]

and sometimes in

A B
b [3, 2]
a [6, 5, 4, 1]

Expected behavior

The result should be

A B
b [5, 4]
a [6, 3, 2, 1]

Installed versions

---Version info---
Polars: 0.14.22
Index type: UInt32
Platform: Linux-5.19.12-arch1-1-x86_64-with-glibc2.36
Python: 3.10.7 (main, Sep  6 2022, 21:22:27) [GCC 12.2.0]
---Optional dependencies---
pyarrow: 6.0.1
pandas: 1.4.1
numpy: 1.22.4
fsspec: 2022.5.0
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: 3.5.1
@syntonym syntonym added bug Something isn't working python Related to Python Polars labels Oct 24, 2022
@syntonym
Copy link
Author

As pointed out on discord by ghuls, the line df.groupby("A").agg(pl.col("B").sort(reverse=True)) produces the correct result.

@ghuls
Copy link
Collaborator

ghuls commented Oct 24, 2022

There might still be an issue with the list() operator anyway. Also depending on the optimizations it produces different results.

In [53]: df = pl.DataFrame({"A": ["a", "a", "a", "b", "b", "a"], "B": [1, 2, 3, 4, 5, 6]})

In [54]: df
Out[54]: 
shape: (6, 2)
┌─────┬─────┐
│ AB   │
│ ------ │
│ stri64 │
╞═════╪═════╡
│ a1   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a2   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b4   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b5   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a6   │
└─────┴─────┘

In [55]: df.lazy().groupby("A").agg([pl.col("B").alias("b"), pl.col("B").sum().alias("b_sum"), pl.col("B").min().alias("b_min"), pl.col("B").max().alias("b_max"), pl.col("B").list().alias("b_list"), pl.col("B").list().sum().alias("b_list_sum"), pl.col("B
    ...: ").list().min().alias("b_list_min"), pl.col("B").list().max().alias("b_list_max"),]).collect()
Out[55]: 
shape: (2, 9)
┌─────┬───────────────┬───────┬───────┬───────┬───────────────┬────────────┬────────────┬────────────┐
│ Abb_sumb_minb_maxb_listb_list_sumb_list_minb_list_max │
│ ---------------------------        │
│ strlist[i64]     ┆ i64i64i64list[i64]     ┆ i64i64i64        │
╞═════╪═══════════════╪═══════╪═══════╪═══════╪═══════════════╪════════════╪════════════╪════════════╡
│ b   ┆ [4, 5]        ┆ 945     ┆ [4, 5]        ┆ 523          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a   ┆ [1, 2, ... 6] ┆ 1216     ┆ [1, 2, ... 6] ┆ 1616          │
└─────┴───────────────┴───────┴───────┴───────┴───────────────┴────────────┴────────────┴────────────┘

In [56]: df.lazy().groupby("A").agg([pl.col("B").alias("b"), pl.col("B").sum().alias("b_sum"), pl.col("B").min().alias("b_min"), pl.col("B").max().alias("b_max"), pl.col("B").list().alias("b_list"), pl.col("B").list().sum().alias("b_list_sum"), pl.col("B
    ...: ").list().min().alias("b_list_min"), pl.col("B").list().max().alias("b_list_max"),]).collect(type_coercion=False, predicate_pushdown=False, projection_pushdown=False, no_optimization=True, slice_pushdown=False, common_subplan_elimination=False)
Out[56]: 
shape: (2, 9)
┌─────┬───────────────┬───────┬───────┬───────┬───────────────┬────────────┬────────────┬────────────┐
│ Abb_sumb_minb_maxb_listb_list_sumb_list_minb_list_max │
│ ---------------------------        │
│ strlist[i64]     ┆ i64i64i64list[i64]     ┆ i64i64i64        │
╞═════╪═══════════════╪═══════╪═══════╪═══════╪═══════════════╪════════════╪════════════╪════════════╡
│ a   ┆ [1, 2, ... 6] ┆ 1216     ┆ [1, 2, ... 6] ┆ 1115          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b   ┆ [4, 5]        ┆ 945     ┆ [4, 5]        ┆ 1046          │
└─────┴───────────────┴───────┴───────┴───────┴───────────────┴────────────┴────────────┴────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants