Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.sort() raises PanicException: 'arg_sort operation not supported for dtype list[i64]' #10047

Closed
2 tasks done
vinhloc30796 opened this issue Jul 24, 2023 · 11 comments · Fixed by #20768
Closed
2 tasks done
Labels
enhancement New feature or an improvement of an existing feature

Comments

@vinhloc30796
Copy link

vinhloc30796 commented Jul 24, 2023

Problem description

Using df.sort() with a list[i64] column raises an error pointing to:

https://github.com/pola-rs/polars/blob/master/polars/polars-core/src/series/series_trait.rs#L390-L399

Also mentioned as a part of #7777

Thanks!

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

>>> import polars as pl
>>> df = pl.from_records(
...     [
...         {"a": []},
...         {"a": [1, 2]},
...         {"a": [1]},
...         {"a": [2, 3, 5]},
...         {"a": [0]},
...         {"a": [1, 1]},
...         {"a": [1, 0]},
...     ],
...     schema={"a": pl.List(inner=int)},
... )
>>> df
shape: (7, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ []        │
│ [1, 2]    │
│ [1]       │
│ [2, 3, 5] │
│ [0]       │
│ [1, 1]    │
│ [1, 0]    │
└───────────┘
>>> df.sort("a")
thread '<unnamed>' panicked at '`sort_with` operation not supported for dtype `list[i64]`', /home/runner/work/polars/polars/polars/polars-core/src/series/series_trait.rs:364:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/vinhloc30796/.pyenv/versions/3.10.8/lib/python3.10/site-packages/polars/dataframe/frame.py", line 3992, in sort
    self.lazy()
  File "/home/vinhloc30796/.pyenv/versions/3.10.8/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1508, in collect
    return wrap_df(ldf.collect())
pyo3_runtime.PanicException: `sort_with` operation not supported for dtype `list[i64]

Expected behavior

shape: (7, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ []        │
│ [0]       │
│ [1]       │
│ [1, 0]    │
│ [1, 1]    │
│ [1, 2]    │
│ [2, 3, 5] │
└───────────┘
@vinhloc30796 vinhloc30796 added the enhancement New feature or an improvement of an existing feature label Jul 24, 2023
@sjt-motif
Copy link

Just wanted to mention that this fixing this would be very helpful for a use case that I have.

@cmdlineluser
Copy link
Contributor

@sjt-motif You could convert it to a struct as a temporary workaround.

df.select(pl.col("a").sort_by(pl.col("a").list.to_struct("max_width")))

# shape: (7, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ list[i64] │
# ╞═══════════╡
# │ []        │
# │ [0]       │
# │ [1]       │
# │ [1, 0]    │
# │ [1, 1]    │
# │ [1, 2]    │
# │ [2, 3, 5] │
# └───────────┘

@sjt-motif
Copy link

@cmdlineluser Thanks! It works great except the descending=True version exhibits some weird behavior:

>>> df.select(pl.col("a").sort_by(pl.col("a").list.to_struct("max_width"), descending=True))

shape: (7, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ []        │
│ [2, 3, 5] │
│ [1]       │
│ [1, 2]    │
│ [1, 1]    │
│ [1, 0]    │
│ [0]       │
└───────────┘

That's quite easy for me to workaround though. Thanks!

@cmdlineluser
Copy link
Contributor

Hm yeah, not sure why [1] comes first there, there's also no nulls_last= for .sort_by

I guess this is closer to what you want:

(df.with_columns(sort_by = pl.col("a").list.to_struct("max_width"))
   .sort("sort_by", descending=True, nulls_last=True))

# shape: (7, 2)
# ┌───────────┬──────────────────┐
# │ a         ┆ sort_by          │
# │ ---       ┆ ---              │
# │ list[i64] ┆ struct[3]        │
# ╞═══════════╪══════════════════╡
# │ [2, 3, 5] ┆ {2,3,5}          │
# │ [1, 2]    ┆ {1,2,null}       │
# │ [1, 1]    ┆ {1,1,null}       │
# │ [1, 0]    ┆ {1,0,null}       │
# │ [1]       ┆ {1,null,null}    │
# │ [0]       ┆ {0,null,null}    │
# │ []        ┆ {null,null,null} │
# └───────────┴──────────────────┘

@vinhloc30796
Copy link
Author

Thanks @cmdlineluser, I basically did that, but dumber (manually find max length by agg.max(), then lambda s: list(s) + [None} * max_length, then finally explode the columns & sort by all).

Your way is wayyyy shorter lol.

@trinebrockhoff
Copy link

If this one gets fixed, then assert_frame_equal will also work for pl.List cols:

import polars as pl
from polars.testing import assert_frame_equal

df1 = pl.DataFrame({"A":[1,2,3,4,5], "B":["H","E","L","L","O"], "C":[[0,0],[0,0],[0,0],[0,0],[0,0]]})
df2 = df1.sort("B")

assert_frame_equal(df1, df2, check_row_order=False)

# ComputeError: cannot sort column of dtype `list[i64]`
# InvalidAssert: cannot set 'check_row_order=False' on frame with unsortable columns

@cmdlineluser
Copy link
Contributor

@trinebrockhoff Even though it's related, perhaps that deserves its own issue?

That particular use-case seems like something that warrants a higher priority.

I also didn't realize at the time of my previous comment that you can pass expressions directly to .sort() - so the .with_columns() wasn't actually needed.

df.sort(
   pl.col("a").list.to_struct("max_width"),
   descending=True,
   nulls_last=True
)   

# shape: (7, 1)
# ┌───────────┐
# │ a         │
# │ ---       │
# │ list[i64] │
# ╞═══════════╡
# │ [2, 3, 5] │
# │ [1, 2]    │
# │ [1, 1]    │
# │ [1, 0]    │
# │ [1]       │
# │ [0]       │
# │ []        │
# └───────────┘

@maxzw
Copy link

maxzw commented Apr 15, 2024

It also doesn't seem to be supported for lists of structs:

schema = {"value": pl.List(pl.Struct({"a": pl.Int32}))}
data = {"value": [[{"a": 1}, {"a": 2}]]}
df = pl.DataFrame(data, schema=schema)
df.sort("value")

Resulting in

InvalidOperationError: `sort_with` operation not supported for dtype `list[struct[1]]`

It would be really nice if sort would support pl.List datatypes, especially since the assert_frame_equal function with check_row_order=False will fail now.

@cmdlineluser
Copy link
Contributor

@maxzw Yeah, it seems that example also causes .group_by to panic.

df.group_by("value").all()
thread '' panicked at crates/polars-core/src/frame/group_by/into_groups.rs:296:52:
PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("cannot sort column of dtype `list[struct[1]]`"))

@failable
Copy link

failable commented Jul 9, 2024

Similar issue?

polars.exceptions.InvalidOperationError: `arg_sort_multiple` operation not supported for dtype `list[str]`

@fbscarel
Copy link

fbscarel commented Nov 2, 2024

This seems similar:

polars.exceptions.InvalidOperationError: cannot sort column of dtype `list[struct[7]]`

Input dataframe is like this, from .glimpse():

$ lines              <list[struct[7]]> [{'description': 'Parts and Supplies', 'unitAmount': 88.33, 'quantity': 1, 'taxRateRef': {'id': 'NON'}, 'inventoryRef': {'id': '75', 'name': 'Sales-Products-Hardware'}, 'id': '1', 'amount': 88.33}], (...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants