Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add cum_count expression function #13478

Merged
merged 5 commits into from
Jan 8, 2024
Merged

feat: Add cum_count expression function #13478

merged 5 commits into from
Jan 8, 2024

Conversation

stinodego
Copy link
Member

@stinodego stinodego commented Jan 6, 2024

Ref #13473
Ref #12420

When used without arguments, this is basically just syntactic sugar for int_range(1, count()+1, 1, dtype=IndexDtype).
I added this to the Rust side as we have no notion of IndexType in Python and I think it's nice to have in Rust anyway.

With arguments it's syntactic sugar for pl.col(...).cum_count(). It makes sense to have this along the other 'shortcut' functions we already have.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Jan 6, 2024
@stinodego stinodego marked this pull request as ready for review January 6, 2024 06:49
@stinodego stinodego changed the title feat: Add cum_count expression feat: Add cum_count expression function Jan 6, 2024
@stinodego stinodego marked this pull request as draft January 6, 2024 07:48
@cmdlineluser
Copy link
Contributor

Perhaps it could also be a good time to deprecate the reverse= named arg.

pl.cum_count(reverse=True)
pl.cum_count().reverse()

(Expr.reverse() did not exist when .cumcount() was initially added - hence the need for the named arg at the time.)

@stinodego
Copy link
Member Author

stinodego commented Jan 6, 2024

Perhaps it could also be a good time to deprecate the reverse= named arg.

I'm not entirely sure. All cum_* functions have a reverse option, might as well include it in this one.

While it is technically redundant, otherwise you have to write df.select(pl.col("a").reverse().cum_sum().reverse()). Which feels bad.
For the cum_count function, a single reverse will do the trick. But let's be consistent and keep the reverse param for all cumulative functions.

@stinodego stinodego marked this pull request as ready for review January 7, 2024 16:24
@ritchie46 ritchie46 merged commit 5e94252 into main Jan 8, 2024
26 checks passed
@ritchie46 ritchie46 deleted the cum-count branch January 8, 2024 07:15
@rben01
Copy link
Contributor

rben01 commented Jan 12, 2024

This counts nulls as +1? Seems confusing when df.count() skips over nulls; I would expect the last element in the column represented by pl.col("x").cum_count() to be the same as df.count().get_column("x"). Like this:

┌──────┬───────────┬─────────┐
│ a    ┆ cum_count ┆ cum_len │
│ ---  ┆ ---       ┆ ---     │
│ i64  ┆ i64       ┆ i64     │
╞══════╪═══════════╪═════════╡
│ 1    ┆ 1         ┆ 1       │
│ null ┆ 1         ┆ 2       │
│ 3    ┆ 2         ┆ 3       │
└──────┴───────────┴─────────┘

Not sure if cum_len is a good name — other options include numbered, row_number, row_index1 (or just skip it altogether and use expr.with_row_index() + 1) — but I would definitely consider cum_count misleading.

@stinodego
Copy link
Member Author

stinodego commented Jan 12, 2024

I don't know what you mean. It functions exactly as you say it does - we made it match df.count(). Both ignore null values now.

import polars as pl

df = pl.DataFrame({"a": [1, 2, 3], "b": [4, None, 6]})

print(df.count())
print(df.select(pl.col("a", "b").cum_count()))
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 3   ┆ 2   │
└─────┴─────┘
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 1   ┆ 1   │
│ 2   ┆ 1   │
│ 3   ┆ 2   │
└─────┴─────┘

@rben01
Copy link
Contributor

rben01 commented Jan 12, 2024

Oh, sorry never mind.. I think I got confused by the int_range(1, count()+1, 1, dtype=IndexDtype) part unconditionally returning row numbers. (And by the fact that elsewhere in polars, such as group_by, count means number of rows, not number of non-nulls.)

@c-peters c-peters added the accepted Ready for implementation label Jan 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants