Keep original order of rows for polars.cut() #4286

tzeitim · 2022-08-06T07:38:56Z

It would be very useful to be able to keep the order of rows from the original series/column when using pl.cut()

cbilot · 2022-08-06T23:19:14Z

Let me see if I can help. Are you suggesting something like a maintain_order keyword on cut?

For demonstration, let's decorate the existing polars.cut to add a maintain_order keyword:

from typing import Optional

import polars as polars

def cut(
    s: polars.internals.series.Series,
    bins: list[float],
    labels: Optional[list[str]] = None,
    break_point_label: str = "break_point",
    category_label: str = "category",
    maintain_order: bool = False,
) -> polars.internals.frame.DataFrame:

    if maintain_order:
        _arg_sort = polars.Series(name="_arg_sort", values=s.argsort())

    result = polars.cut(s, bins, labels, break_point_label, category_label)

    if maintain_order:
        result = (
            result
            .select([
                polars.all(),
                _arg_sort,
            ])
            .sort('_arg_sort')
            .drop('_arg_sort')
        )

    return result

Now, if we start with a series like this:

my_series = polars.Series(
    name="my_series",
    values=[4.0, 1, 3, 4, 4, 1],
)
my_series

shape: (6,)
Series: 'my_series' [f64]
[
        4.0
        1.0
        3.0
        4.0
        4.0
        1.0
]

We could maintain the original order of the Series with:

cut(my_series, [2, 4], maintain_order=True)

>>> cut(my_series, [2, 4], maintain_order=True)
shape: (6, 3)
┌───────────┬─────────────┬─────────────┐
│ my_series ┆ break_point ┆ category    │
│ ---       ┆ ---         ┆ ---         │
│ f64       ┆ f64         ┆ cat         │
╞═══════════╪═════════════╪═════════════╡
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
└───────────┴─────────────┴─────────────┘

I could see where the above would be helpful when the Series was derived from a large DataFrame. If cut can restore the original order, then hstack can be used to add the categorical variable created by cut directly back to the original DataFrame.

(Another workaround is to sort the original DataFrame by the series used in cut, and then hstack the results of the existing polars.cut ... but that potentially means sorting a large DataFrame with many columns.)

And for those who don't want the additional overhead of restoring the original order:

cut(my_series, [2, 4])

┌───────────┬─────────────┬─────────────┐
│ my_series ┆ break_point ┆ category    │
│ ---       ┆ ---         ┆ ---         │
│ f64       ┆ f64         ┆ cat         │
╞═══════════╪═════════════╪═════════════╡
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
└───────────┴─────────────┴─────────────┘

If the above is suitable, I would politely recommend that maintain_order=False be the default, due to the additional overhead of restoring the original order to the data. As an example, polars.cut is being used to create histograms for exploratory data analysis #4240.

(
    cut(my_series, [2, 4])
    .groupby('category')
    .count()
    .sort('category')
)

shape: (2, 2)
┌─────────────┬───────┐
│ category    ┆ count │
│ ---         ┆ ---   │
│ cat         ┆ u32   │
╞═════════════╪═══════╡
│ (-inf, 2.0] ┆ 2     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ (2.0, 4.0]  ┆ 4     │
└─────────────┴───────┘

In the above case, restoring the original order does not help with the histogram, but represents a performance penalty.

tzeitim · 2022-08-07T09:11:33Z

Thanks for the answer! Yes, my suggestion/feature request is to include an argument to cut to preserve the original order for the exact reason you mentioned (stacking a column to a pre-existing data frame).

I agree that having maintain_order=False as default makes sense in order to have the most performant variation of the function on top, specially when invoked as a standalone function (e.g. pl.cut()) but I am not so sure in the context of an expression (if at some point cut gets to that level), e.g :

df.with_column(pl.col('whatever_column').cut(bins=[1.1, 2, 10, 100]))

And well, even in this scenario the default option could still be maintain_order=False, just like in groupby.

In any case, a keyword option to maintain order would be really useful.

hpux735 · 2022-09-04T15:19:00Z

When this is implemented, please let me know and I'll update the Rust version of cut() to match the behavior.

ritchie46 · 2022-09-04T18:50:47Z

When this is implemented, please let me know and I'll update the Rust version of cut() to match the behavior.

I think we can now actually replace the python function with your work. ;)

PierreSnell · 2022-11-03T14:52:44Z

Thanks for the answer! Yes, my suggestion/feature request is to include an argument to cut to preserve the original order for the exact reason you mentioned (stacking a column to a pre-existing data frame).

I agree that having maintain_order=False as a default makes sense in order to have the most performant variation of the function on top, especially when invoked as a standalone function (e.g. pl.cut()) but I am not so sure in the context of an expression (if at some point cut gets to that level), e.g :
df.with_column(pl.col('whatever_column').cut(bins=[1.1, 2, 10, 100]))
And well, even in this scenario the default option could still be maintain_order=False, just like in groupby.

In any case, a keyword option to maintain order would be really useful.

Exactly what I was trying to achieve and took me some time before falling here.
I agree that it would be a really nice behavior so we can "binarize" a column and stack the result.

(Another workaround is to sort the original DataFrame by the series used in cut, and then hstack the results of the existing polars.cut ... but that potentially means sorting a large DataFrame with many columns.)

However, doing this in my use case worked perfectly (yes it's slower but for my work was ok).

Maybe having a note in the documentation that indicates that the function does not keep order (unfortunately assumed by beginners like me) ?

Thanks for your work!

Hoeze · 2022-11-12T00:17:29Z

Just hit the same issue! pl.col().cut() would be highly appreciated :)

…ment)

ArthurJ · 2022-12-04T11:42:35Z

I would like to add that it would be nice to have an option to "autocut" as well, where we can just tell how many bins and it would decide the break points using an uniform distribution based on count.

a-reich · 2023-01-26T03:51:03Z

I think aside from the main topic of having a feature to maintain order, it’s also important to make clear in the docs that the current version does not. It’s all too easy to assume wrong and get incorrect results without noticing, like I did.

tzeitim · 2023-02-28T12:07:27Z

Be aware of #7058

s-banach · 2023-06-26T20:01:44Z

I just discovered that maintain_order=False is the default.
If the order is not maintained, the only way to use the cutted column is to join it back onto the original dataframe.
In that case, cut should only return unique rows.
The current behavior doesn't make any sense.

zundertj added the feature label Aug 8, 2022

tzeitim added a commit to tzeitim/ogtk that referenced this issue Nov 24, 2022

included polars.cut with maintain order from pola-rs/polars#4286 (com…

142dca6

…ment)

ritchie46 mentioned this issue Mar 23, 2023

feat(rust, python): add maintain_order option to Series.cut #7723

Merged

ritchie46 closed this as completed in #7723 Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep original order of rows for polars.cut() #4286

Keep original order of rows for polars.cut() #4286

tzeitim commented Aug 6, 2022

cbilot commented Aug 6, 2022 •

edited

Loading

tzeitim commented Aug 7, 2022

hpux735 commented Sep 4, 2022

ritchie46 commented Sep 4, 2022

PierreSnell commented Nov 3, 2022 •

edited

Loading

Hoeze commented Nov 12, 2022

ArthurJ commented Dec 4, 2022

a-reich commented Jan 26, 2023

tzeitim commented Feb 28, 2023

s-banach commented Jun 26, 2023

Keep original order of rows for polars.cut() #4286

Keep original order of rows for polars.cut() #4286

Comments

tzeitim commented Aug 6, 2022

cbilot commented Aug 6, 2022 • edited Loading

tzeitim commented Aug 7, 2022

hpux735 commented Sep 4, 2022

ritchie46 commented Sep 4, 2022

PierreSnell commented Nov 3, 2022 • edited Loading

Hoeze commented Nov 12, 2022

ArthurJ commented Dec 4, 2022

a-reich commented Jan 26, 2023

tzeitim commented Feb 28, 2023

s-banach commented Jun 26, 2023

cbilot commented Aug 6, 2022 •

edited

Loading

PierreSnell commented Nov 3, 2022 •

edited

Loading