Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep original order of rows for polars.cut() #4286

Closed
tzeitim opened this issue Aug 6, 2022 · 10 comments · Fixed by #7723
Closed

Keep original order of rows for polars.cut() #4286

tzeitim opened this issue Aug 6, 2022 · 10 comments · Fixed by #7723

Comments

@tzeitim
Copy link

tzeitim commented Aug 6, 2022

It would be very useful to be able to keep the order of rows from the original series/column when using pl.cut()

@cbilot
Copy link

cbilot commented Aug 6, 2022

Let me see if I can help. Are you suggesting something like a maintain_order keyword on cut?

For demonstration, let's decorate the existing polars.cut to add a maintain_order keyword:

from typing import Optional

import polars as polars

def cut(
    s: polars.internals.series.Series,
    bins: list[float],
    labels: Optional[list[str]] = None,
    break_point_label: str = "break_point",
    category_label: str = "category",
    maintain_order: bool = False,
) -> polars.internals.frame.DataFrame:

    if maintain_order:
        _arg_sort = polars.Series(name="_arg_sort", values=s.argsort())

    result = polars.cut(s, bins, labels, break_point_label, category_label)

    if maintain_order:
        result = (
            result
            .select([
                polars.all(),
                _arg_sort,
            ])
            .sort('_arg_sort')
            .drop('_arg_sort')
        )

    return result

Now, if we start with a series like this:

my_series = polars.Series(
    name="my_series",
    values=[4.0, 1, 3, 4, 4, 1],
)
my_series
shape: (6,)
Series: 'my_series' [f64]
[
        4.0
        1.0
        3.0
        4.0
        4.0
        1.0
]

We could maintain the original order of the Series with:

cut(my_series, [2, 4], maintain_order=True)
>>> cut(my_series, [2, 4], maintain_order=True)
shape: (6, 3)
┌───────────┬─────────────┬─────────────┐
│ my_series ┆ break_point ┆ category    │
│ ---       ┆ ---         ┆ ---         │
│ f64       ┆ f64         ┆ cat         │
╞═══════════╪═════════════╪═════════════╡
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
└───────────┴─────────────┴─────────────┘

I could see where the above would be helpful when the Series was derived from a large DataFrame. If cut can restore the original order, then hstack can be used to add the categorical variable created by cut directly back to the original DataFrame.

(Another workaround is to sort the original DataFrame by the series used in cut, and then hstack the results of the existing polars.cut ... but that potentially means sorting a large DataFrame with many columns.)

And for those who don't want the additional overhead of restoring the original order:

cut(my_series, [2, 4])
┌───────────┬─────────────┬─────────────┐
│ my_series ┆ break_point ┆ category    │
│ ---       ┆ ---         ┆ ---         │
│ f64       ┆ f64         ┆ cat         │
╞═══════════╪═════════════╪═════════════╡
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
└───────────┴─────────────┴─────────────┘

If the above is suitable, I would politely recommend that maintain_order=False be the default, due to the additional overhead of restoring the original order to the data. As an example, polars.cut is being used to create histograms for exploratory data analysis #4240.

(
    cut(my_series, [2, 4])
    .groupby('category')
    .count()
    .sort('category')
)
shape: (2, 2)
┌─────────────┬───────┐
│ category    ┆ count │
│ ---         ┆ ---   │
│ cat         ┆ u32   │
╞═════════════╪═══════╡
│ (-inf, 2.0] ┆ 2     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ (2.0, 4.0]  ┆ 4     │
└─────────────┴───────┘

In the above case, restoring the original order does not help with the histogram, but represents a performance penalty.

@tzeitim
Copy link
Author

tzeitim commented Aug 7, 2022

Thanks for the answer! Yes, my suggestion/feature request is to include an argument to cut to preserve the original order for the exact reason you mentioned (stacking a column to a pre-existing data frame).

I agree that having maintain_order=False as default makes sense in order to have the most performant variation of the function on top, specially when invoked as a standalone function (e.g. pl.cut()) but I am not so sure in the context of an expression (if at some point cut gets to that level), e.g :

df.with_column(pl.col('whatever_column').cut(bins=[1.1, 2, 10, 100]))

And well, even in this scenario the default option could still be maintain_order=False, just like in groupby.

In any case, a keyword option to maintain order would be really useful.

@hpux735
Copy link
Contributor

hpux735 commented Sep 4, 2022

When this is implemented, please let me know and I'll update the Rust version of cut() to match the behavior.

@ritchie46
Copy link
Member

When this is implemented, please let me know and I'll update the Rust version of cut() to match the behavior.

I think we can now actually replace the python function with your work. ;)

@PierreSnell
Copy link

PierreSnell commented Nov 3, 2022

Thanks for the answer! Yes, my suggestion/feature request is to include an argument to cut to preserve the original order for the exact reason you mentioned (stacking a column to a pre-existing data frame).

I agree that having maintain_order=False as a default makes sense in order to have the most performant variation of the function on top, especially when invoked as a standalone function (e.g. pl.cut()) but I am not so sure in the context of an expression (if at some point cut gets to that level), e.g :

df.with_column(pl.col('whatever_column').cut(bins=[1.1, 2, 10, 100]))

And well, even in this scenario the default option could still be maintain_order=False, just like in groupby.

In any case, a keyword option to maintain order would be really useful.

Exactly what I was trying to achieve and took me some time before falling here.
I agree that it would be a really nice behavior so we can "binarize" a column and stack the result.

(Another workaround is to sort the original DataFrame by the series used in cut, and then hstack the results of the existing polars.cut ... but that potentially means sorting a large DataFrame with many columns.)

However, doing this in my use case worked perfectly (yes it's slower but for my work was ok).

Maybe having a note in the documentation that indicates that the function does not keep order (unfortunately assumed by beginners like me) ?

Thanks for your work!

@Hoeze
Copy link

Hoeze commented Nov 12, 2022

Just hit the same issue! pl.col().cut() would be highly appreciated :)

@ArthurJ
Copy link

ArthurJ commented Dec 4, 2022

I would like to add that it would be nice to have an option to "autocut" as well, where we can just tell how many bins and it would decide the break points using an uniform distribution based on count.

@a-reich
Copy link

a-reich commented Jan 26, 2023

I think aside from the main topic of having a feature to maintain order, it’s also important to make clear in the docs that the current version does not. It’s all too easy to assume wrong and get incorrect results without noticing, like I did.

@tzeitim
Copy link
Author

tzeitim commented Feb 28, 2023

Be aware of #7058

@s-banach
Copy link
Contributor

I just discovered that maintain_order=False is the default.
If the order is not maintained, the only way to use the cutted column is to join it back onto the original dataframe.
In that case, cut should only return unique rows.
The current behavior doesn't make any sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants