-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep original order of rows for polars.cut() #4286
Comments
Let me see if I can help. Are you suggesting something like a For demonstration, let's decorate the existing from typing import Optional
import polars as polars
def cut(
s: polars.internals.series.Series,
bins: list[float],
labels: Optional[list[str]] = None,
break_point_label: str = "break_point",
category_label: str = "category",
maintain_order: bool = False,
) -> polars.internals.frame.DataFrame:
if maintain_order:
_arg_sort = polars.Series(name="_arg_sort", values=s.argsort())
result = polars.cut(s, bins, labels, break_point_label, category_label)
if maintain_order:
result = (
result
.select([
polars.all(),
_arg_sort,
])
.sort('_arg_sort')
.drop('_arg_sort')
)
return result Now, if we start with a series like this: my_series = polars.Series(
name="my_series",
values=[4.0, 1, 3, 4, 4, 1],
)
my_series
We could maintain the original order of the Series with: cut(my_series, [2, 4], maintain_order=True)
I could see where the above would be helpful when the Series was derived from a large DataFrame. If (Another workaround is to sort the original DataFrame by the series used in And for those who don't want the additional overhead of restoring the original order: cut(my_series, [2, 4])
If the above is suitable, I would politely recommend that (
cut(my_series, [2, 4])
.groupby('category')
.count()
.sort('category')
)
In the above case, restoring the original order does not help with the histogram, but represents a performance penalty. |
Thanks for the answer! Yes, my suggestion/feature request is to include an argument to I agree that having
And well, even in this scenario the default option could still be In any case, a keyword option to maintain order would be really useful. |
When this is implemented, please let me know and I'll update the Rust version of |
I think we can now actually replace the python function with your work. ;) |
Exactly what I was trying to achieve and took me some time before falling here.
However, doing this in my use case worked perfectly (yes it's slower but for my work was ok). Maybe having a note in the documentation that indicates that the function does not keep order (unfortunately assumed by beginners like me) ? Thanks for your work! |
Just hit the same issue! |
I would like to add that it would be nice to have an option to "autocut" as well, where we can just tell how many bins and it would decide the break points using an uniform distribution based on count. |
I think aside from the main topic of having a feature to maintain order, it’s also important to make clear in the docs that the current version does not. It’s all too easy to assume wrong and get incorrect results without noticing, like I did. |
Be aware of #7058 |
I just discovered that |
It would be very useful to be able to keep the order of rows from the original series/column when using pl.cut()
The text was updated successfully, but these errors were encountered: