Add `pl.row_num()` as syntactic sugar for `pl.int_range(0, pl.count())` #12420

mcrumiller · 2023-11-13T19:41:00Z

Description

Having a quick expression to grab an integer-based row number is often requested and would be pretty useful. Instead of the usual options:

df.with_row_count("#").select(col("#") < ...).drop(#") # or:
df.select(pl.int_range(0, pl.count()) < ...)

This would simplify to:

df.select(pl.row_num() < ...)

lyngc · 2023-11-13T19:52:33Z

Prefer pl.row_num() if anything. nbr is not an intuitive abbreviation in this case imo.

cmdlineluser · 2023-11-14T06:23:10Z

A somewhat related suggestion was for pl.cumcount() to be a top-level function. #7120 (although that is unsigned int)

In trying to see what this is called elsewhere, it seems row_number() is commonly used (SQL, Spark, TidyVerse, ...?) - perhaps it makes sense to line up with those.

stinodego · 2023-11-14T09:40:27Z

There already is DataFrame.with_row_count:
https://pola-rs.github.io/polars/docs/python/dev/reference/dataframe/api/polars.DataFrame.with_row_count.html

Adding this new expression shortcut seems fine with me. It should either be called pl.row_count() to align with the existing DataFrame method, or it both should be renamed to row_number. My preference would be row_number.

The resulting data type should be UInt32, or UInt64 in bigidx Polars.

deanm0000 · 2023-11-14T11:25:22Z

I think row_number is "better" but I don't think it's so much better that it's worth renaming with_row_count.

What about naming the expression row_number and deprecating with_row_count? I realize deprecating it is more severe than just renaming it. Personally, I've never wanted with_row_count on a stand alone basis. I only ever use it as a precursor to do something else in a with_columns. If I can get row_number directly in the context then I, for one, would never use with_row_count.

Apart from how I use it, it does seem odd that it's the only with_* to supplement with_columns. There's no with_summary or with_shifted or anything else so it may just be good to retire as being out of place.

cmdlineluser · 2023-11-14T11:45:36Z

@deanm0000

I think someone pointed out in a previous discussion that with_row_count is useful because it inserts the column at a specified position.

If it was only an expression you would need to use .select()

df.select(pl.row_number(), pl.all())

(Or, at least, that's what I think the suggestion was.)

stinodego · 2023-11-14T11:48:16Z

Keep in mind that we don't need any of these methods - they are simply for convenience.

With that in mind, the DataFrame method makes sense because row numbers make a lot of sense in a DataFrame context - people trying to mimic indexes.

I think it's fine to have both.

deanm0000 · 2023-11-14T13:00:49Z

Wait a minute. Shouldn't this just be adding pl.cumcount()? Is there any different behavior between let's say df.with_columns(i=pl.first().cumcount()) than df.with_columns(i=pl.int_range(0, pl.count())?

It seems they do the same thing even if pl.first() has some nulls. Of course if we had pl.cumcount() then having nulls in any columns would be moot anyways.

Also, if I do timeit on a 1M row df, then cumcount takes 272 µs ± 30.8 µs per loop whereas int_range takes 1.63 ms ± 30.5 µs per loop.

cmdlineluser · 2023-11-14T13:29:17Z

They differ in types, .cumcount() is unsigned (but so is .with_row_count())

pl.select(pl.lit('foo').cumcount()).schema
# OrderedDict([('literal', UInt32)])

pl.select(pl.int_range(0, 1)).schema
# OrderedDict([('int', Int64)])

pl.select(pl.lit('foo')).with_row_count().schema
# OrderedDict([('row_nr', UInt32), ('literal', Utf8)])

I guess the idea of wanting Int64 is so you don't have to cast if you want to use it e.g.

.rolling(index_column = pl.foo(), ...

versus

.rolling(index_column = pl.foo().cast(int), ...

Wainberg · 2024-01-06T07:44:56Z

+1 for having pl.row_number() and df.with_row_number(), each of which create a row_number column. Although having pl.row_count() and df.with_row_count() create a row_count column also works, it's most important for all three of them to be the same.

nameexhaustion · 2024-01-07T09:52:22Z

Small nit - if it's called "row number", should it start counting from 1 by default like SQL does instead of 0?

stinodego · 2024-01-07T10:03:08Z

Small nit - if it's called "row number", should it start counting from 1 by default like SQL does instead of 0?

I considered this but I don't think so. We start counting at 0, right? I find the SQL behavior here strange.

The cum_count function on the other should should start counting at 1.

lyngc · 2024-01-07T10:32:01Z

I would expect row_number() to behave the same as it does in SQL, so start counting from 1.

mcrumiller · 2024-01-07T14:08:49Z

If that's the case then I think I would prefer row_index. Having zero-indexed rows is the most common use case.

stinodego · 2024-01-07T15:40:07Z

I also considered row_index. Maybe that's better indeed...

orlp · 2024-01-07T17:11:48Z

My vote goes to pl.row_index(), starting counting at zero (at least by default). I would also be in support of a rename of with_row_count to with_row_index for consistency.

mcrumiller · 2024-01-07T17:15:54Z

Would it be too much clutter to include both? a with_row_count for those who want to match SQL and with_row_index for those who want zero-indexed behavior?

An alternative is with_row_index(start=0) which would be one function, with default behavior of 0-indexed, and allow for easy arbitrary start positions.

Come to think of it, this would be a pretty useful in general: if you're building up a series of frames and want the next frame to start at the next index, you could begin at whatever start you want.

orlp · 2024-01-07T17:17:15Z

@mcrumiller with_row_count/with_row_index already support an offset parameter. You can pass offset = 1 and get SQL's behavior.

Thinking about it some more I don't think it'd really be necessary for pl.row_index() since you can just write pl.row_index() + offset.

ritchie46 · 2024-01-07T19:13:01Z

There

They differ in types, .cumcount() is unsigned (but so is .with_row_count())
pl.select(pl.lit('foo').cumcount()).schema
# OrderedDict([('literal', UInt32)])
pl.select(pl.int_range(0, 1)).schema
# OrderedDict([('int', Int64)])
pl.select(pl.lit('foo')).with_row_count().schema
# OrderedDict([('row_nr', UInt32), ('literal', Utf8)])
I guess the idea of wanting Int64 is so you don't have to cast if you want to use it e.g.
.rolling(index_column = pl.foo(), ...
versus
.rolling(index_column = pl.foo().cast(int), ...

We can except a dtype in the expression.

I'd like to slow this down a little bit. I don't think we should add an expression for this. We have ranges, a well known construct in many programming languages.

Wainberg · 2024-01-07T19:29:30Z

We have ranges, a well known construct in many programming languages.

Yes, though I'd argue pl.int_range(0, pl.count()).alias('row_index') is quite a mouthful compared to pl.row_index() for such a common operation!

In practice people would probably just use with_row_count() or whatever it gets renamed to. Though by your argument, shouldn't with_row_count() also be deprecated? After all, we have ranges ;)

mcrumiller · 2024-01-07T19:47:52Z

Though by your argument, shouldn't with_row_count() also be deprecated? After all, we have ranges ;)

This doesn't really help when trying to bring someone to your point of view....

That being said, I agree that this whole conversion is being muddled. I think many of us agree that being able to quickly index columns and frames by integer number is very useful and common enough that it warrants some function. The cleanest solution to me is a single expression pl.row_index(), and alongside of that a single dataframe function df.with_row_index(). I even think pl.index() alone would be nice, but it may cause confusion with pandas indexes.

# default input args shown for clarity
df.select(pl.row_index(start=0, dtype=pl.UInt32))
df.with_row_index(start=0, dtype=pl.UInt32)  #

stinodego · 2024-01-08T05:44:17Z

I also think it's worth adding a pl.index() function, but if Ritchie isn't on board, we don't need to rush this. I'll add a note to the int_range docs with an example on how to create an index column.

stinodego · 2024-01-08T11:51:50Z

Note that in the next release, you will be able to write pl.int_range(pl.len()) - a shorter syntax omitting the zero (#13530).

mcrumiller · 2024-01-16T17:56:48Z

I know this is closed, but I really think a pl.row_index() expression would save a lot of grief when people want to deal with row numbers, over having to use pl.int_range(pl.len()). It's a lot clearer what the intent is. I can't tell from the above comments if it was firmly decided "no" we don't want that, or if we wanted to discuss more.

stinodego · 2024-01-16T20:17:10Z

As per Ritchie's comment:

I don't think we should add an expression for this. We have ranges, a well known construct in many programming languages.

So it is marked as "not planned". We may change our tune in the future.

mcrumiller · 2024-01-30T17:10:23Z

Use pl.int_range(pl.len()):

import polars as pl

df = pl.DataFrame({
    "group": [1, 1, 1, 2, 2],
    "value": [1, None, 2, 3, 4],
})

df.sort('value', descending=True).with_columns(pl.int_range(1, pl.len()+1).over('group').alias('nth'))

shape: (5, 3)
┌───────┬───────┬─────┐
│ group ┆ value ┆ nth │
│ ---   ┆ ---   ┆ --- │
│ i64   ┆ i64   ┆ i64 │
╞═══════╪═══════╪═════╡
│ 1     ┆ null  ┆ 1   │
│ 2     ┆ 4     ┆ 1   │
│ 2     ┆ 3     ┆ 2   │
│ 1     ┆ 2     ┆ 2   │
│ 1     ┆ 1     ┆ 3   │
└───────┴───────┴─────┘

mkleinbort-ic · 2024-01-30T17:10:55Z

Sorry I deleted my comment, I realized I could just use .rank()

mcrumiller · 2024-01-30T17:13:08Z

Careful though, rank won't give you a rank for nulls.

mkleinbort-ic · 2024-01-30T17:14:59Z

Thank you.

In my case I don't have nulls so this will be ok.

Can't say I love pl.int_range(1, pl.len()+1), but this is a closed issue and thus not a place to discuss.

mcrumiller · 2024-01-30T17:20:28Z

@mkleinbort-ic I work around this now by always defining a row_index expression:

import polars as pl

def row_index():
    return pl.int_range(1, pl.len()+1)

df = pl.DataFrame({
    "group": [1, 1, 1, 2, 2],
    "value": [1, None, 2, 3, 4],
})
df.sort('value', descending=True).with_columns(row_index().over('group').alias('nth'))

Doesn't have to be a function but it's more obvious that it's computing something.

Oreilles · 2024-07-23T00:08:15Z

I also really think that pl.row_index() would be a valuable addition to polars.

Using pl.int_range(pl.len()) is more verbose and do not reflect the intent of pl.row_index().

Furthermore, int_range returns pl.Int64 wherehas with_row_index() returns pl.Uint32, meaning that the two are not even compatible by default if you were to use these as a join key, which is the primary use case for an row index.

Oreilles · 2024-07-24T12:52:27Z

Actually, there is also a significant difference between pl.int-range(pl.len()) and the requested pl.row_index(): int_range doesn't return an elementwise expression, meaning that it cannot be used in join:

a = pl.DataFrame({"x": ["a","b","c"]})
b = pl.DataFrame({"y": ["d","e"]})

a.join(
    b,
    left_on=pl.int_range(pl.len()).mod(len(b)),
    right_on=pl.int_range(pl.len()),
)
# InvalidOperationError: All join key expressions must be elementwise.

mcrumiller added the enhancement New feature or an improvement of an existing feature label Nov 13, 2023

mcrumiller changed the title ~~Add pl.row() as syntactic sugar for pl.int_range(0, pl.count())~~ Add pl.row_nbr() as syntactic sugar for pl.int_range(0, pl.count()) Nov 13, 2023

mcrumiller changed the title ~~Add pl.row_nbr() as syntactic sugar for pl.int_range(0, pl.count())~~ Add pl.row_num() as syntactic sugar for pl.int_range(0, pl.count()) Nov 13, 2023

stinodego added the accepted Ready for implementation label Nov 14, 2023

stinodego added this to the 1.0.0 milestone Nov 14, 2023

cmdlineluser mentioned this issue Nov 23, 2023

The column created by with_row_count() should be called "row_count", not "row_nr" #12624

Closed

stinodego mentioned this issue Jan 6, 2024

feat: Add cum_count expression function #13478

Merged

stinodego mentioned this issue Jan 6, 2024

depr(python,rust!): Rename with_row_count to with_row_index #13494

Merged

stinodego self-assigned this Jan 6, 2024

stinodego mentioned this issue Jan 8, 2024

docs(python): Add note to int_range docs for creating an index column #13516

Merged

stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2024

stinodego removed the accepted Ready for implementation label Jan 8, 2024

stinodego removed this from the 1.0.0 milestone Jan 8, 2024

stinodego mentioned this issue Jan 8, 2024

docs(python): Add more doc examples on how to create an index column #13532

Merged

iurimateus mentioned this issue Jan 20, 2024

Add DataFrame.with_row_count/2 elixir-explorer/explorer#833

Closed

stinodego mentioned this issue Feb 18, 2024

Add pl.cumcount() #7120

Closed

cmdlineluser mentioned this issue Jul 23, 2024

Make int_range() and int_ranges() work with no inputs and default to int_range(0, pl.len()) #17810

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `pl.row_num()` as syntactic sugar for `pl.int_range(0, pl.count())` #12420

Add `pl.row_num()` as syntactic sugar for `pl.int_range(0, pl.count())` #12420

mcrumiller commented Nov 13, 2023 •

edited

Loading

lyngc commented Nov 13, 2023

cmdlineluser commented Nov 14, 2023

stinodego commented Nov 14, 2023 •

edited

Loading

deanm0000 commented Nov 14, 2023

cmdlineluser commented Nov 14, 2023

stinodego commented Nov 14, 2023 •

edited

Loading

deanm0000 commented Nov 14, 2023

cmdlineluser commented Nov 14, 2023

Wainberg commented Jan 6, 2024

nameexhaustion commented Jan 7, 2024

stinodego commented Jan 7, 2024 •

edited

Loading

lyngc commented Jan 7, 2024

mcrumiller commented Jan 7, 2024

stinodego commented Jan 7, 2024 •

edited

Loading

orlp commented Jan 7, 2024 •

edited

Loading

mcrumiller commented Jan 7, 2024 •

edited

Loading

orlp commented Jan 7, 2024 •

edited

Loading

ritchie46 commented Jan 7, 2024

Wainberg commented Jan 7, 2024

mcrumiller commented Jan 7, 2024 •

edited

Loading

stinodego commented Jan 8, 2024

stinodego commented Jan 8, 2024 •

edited

Loading

mcrumiller commented Jan 16, 2024

stinodego commented Jan 16, 2024

mcrumiller commented Jan 30, 2024

mkleinbort-ic commented Jan 30, 2024

mcrumiller commented Jan 30, 2024

mkleinbort-ic commented Jan 30, 2024

mcrumiller commented Jan 30, 2024

Oreilles commented Jul 23, 2024

Oreilles commented Jul 24, 2024 •

edited

Loading

Add pl.row_num() as syntactic sugar for pl.int_range(0, pl.count()) #12420

Add pl.row_num() as syntactic sugar for pl.int_range(0, pl.count()) #12420

Comments

mcrumiller commented Nov 13, 2023 • edited Loading

Description

lyngc commented Nov 13, 2023

cmdlineluser commented Nov 14, 2023

stinodego commented Nov 14, 2023 • edited Loading

deanm0000 commented Nov 14, 2023

cmdlineluser commented Nov 14, 2023

stinodego commented Nov 14, 2023 • edited Loading

deanm0000 commented Nov 14, 2023

cmdlineluser commented Nov 14, 2023

Wainberg commented Jan 6, 2024

nameexhaustion commented Jan 7, 2024

stinodego commented Jan 7, 2024 • edited Loading

lyngc commented Jan 7, 2024

mcrumiller commented Jan 7, 2024

stinodego commented Jan 7, 2024 • edited Loading

orlp commented Jan 7, 2024 • edited Loading

mcrumiller commented Jan 7, 2024 • edited Loading

orlp commented Jan 7, 2024 • edited Loading

ritchie46 commented Jan 7, 2024

Wainberg commented Jan 7, 2024

mcrumiller commented Jan 7, 2024 • edited Loading

stinodego commented Jan 8, 2024

stinodego commented Jan 8, 2024 • edited Loading

mcrumiller commented Jan 16, 2024

stinodego commented Jan 16, 2024

mcrumiller commented Jan 30, 2024

mkleinbort-ic commented Jan 30, 2024

mcrumiller commented Jan 30, 2024

mkleinbort-ic commented Jan 30, 2024

mcrumiller commented Jan 30, 2024

Oreilles commented Jul 23, 2024

Oreilles commented Jul 24, 2024 • edited Loading

Add `pl.row_num()` as syntactic sugar for `pl.int_range(0, pl.count())` #12420

Add `pl.row_num()` as syntactic sugar for `pl.int_range(0, pl.count())` #12420

mcrumiller commented Nov 13, 2023 •

edited

Loading

stinodego commented Nov 14, 2023 •

edited

Loading

stinodego commented Nov 14, 2023 •

edited

Loading

stinodego commented Jan 7, 2024 •

edited

Loading

stinodego commented Jan 7, 2024 •

edited

Loading

orlp commented Jan 7, 2024 •

edited

Loading

mcrumiller commented Jan 7, 2024 •

edited

Loading

orlp commented Jan 7, 2024 •

edited

Loading

mcrumiller commented Jan 7, 2024 •

edited

Loading

stinodego commented Jan 8, 2024 •

edited

Loading

Oreilles commented Jul 24, 2024 •

edited

Loading