Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arange and date_range expressions differ in dimensions #9019

Closed
stinodego opened this issue May 24, 2023 · 16 comments · Fixed by #10526
Closed

arange and date_range expressions differ in dimensions #9019

stinodego opened this issue May 24, 2023 · 16 comments · Fixed by #10526
Assignees
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Milestone

Comments

@stinodego
Copy link
Contributor

stinodego commented May 24, 2023

Consider the following behaviour for arange:

>>> pl.select(pl.arange(0, 3))
shape: (3, 1)
┌─────────┐
│ literal │
│ ---     │
│ i64     │
╞═════════╡
│ 0       │
│ 1       │
│ 2       │
└─────────┘

And compare it with the behaviour of date_range:

>>> pl.select(pl.date_range(date(2022,1,1), date(2022,1, 3)))
shape: (1, 1)
┌───────────────────────────────────┐
│ literal                           │
│ ---                               │
│ list[date]                        │
╞═══════════════════════════════════╡
│ [2022-01-01, 2022-01-02, 2022-01… │
└───────────────────────────────────┘

arange creates an expression with length n, while date_range creates a list-type expression with a single item: a list of length n.

To me, the date_range behaviour is surprising. For arange, the eager and lazy variants are similar in dimension, while for date_range, they are completely different (requires explode/implode to convert between the two).

What's the reason date_range behaves this way in lazy? And should we have the same behaviour for arange?

@stinodego stinodego added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 24, 2023
@ritchie46
Copy link
Member

The idea is that you can create a date range per row. Imagine having a column start and column offset. Then you create a range per element.

We tried to infer the behavior earlier, but this turns out to lead to inconsistencies, so I think we should accept a keyword that defines if we want to evaluate a single element or a column of elements.

@MarcoGorelli
Copy link
Collaborator

Imagine having a column start and column offset. Then you create a range per element.

sure but might you also want to create an arange per element?

The current behaviour, based on the last commit, is

In [3]: pl.select(pl.date_range(date(2022,1,1), date(2022,1, 3), eager=True))
Out[3]: 
shape: (3, 1)
┌────────────┐
│            │
│ ---        │
│ date       │
╞════════════╡
│ 2022-01-01 │
│ 2022-01-02 │
│ 2022-01-03 │
└────────────┘

In [4]: pl.select(pl.date_range(date(2022,1,1), date(2022,1, 3), eager=False))
Out[4]: 
shape: (1, 1)
┌───────────────────────────────────┐
│ literal                           │
│ ---                               │
│ list[date]                        │
╞═══════════════════════════════════╡
│ [2022-01-01, 2022-01-02, 2022-01… │
└───────────────────────────────────┘

In [5]: pl.select(pl.arange(0, 3, eager=True))
Out[5]: 
shape: (3, 1)
┌────────┐
│ arange │
│ ---    │
│ i64    │
╞════════╡
│ 0      │
│ 1      │
│ 2      │
└────────┘

In [6]: pl.select(pl.arange(0, 3, eager=False))
Out[6]: 
shape: (3, 1)
┌─────────┐
│ literal │
│ ---     │
│ i64     │
╞═════════╡
│ 0       │
│ 1       │
│ 2       │
└─────────┘

For consistency, perhaps the last one should be

shape: (1, 1)
┌───────────┐
│ literal   │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [0, 1, 2] │
└───────────┘

?

@ritchie46
Copy link
Member

For consistency, perhaps the last one should be

Yeap, let's do that.

@stinodego
Copy link
Contributor Author

stinodego commented May 24, 2023

Yes, that was what I was getting at. But that's not the full story.

We have two behaviours:

  • Vertical (column of length n)
  • Horizontal (list type column)

And we have two compute strategies:

  • Eager
  • Lazy

In your example, you're using the eager/lazy toggle to change the behaviour, and that is undesirable. So an extra keyword for the behaviour would make sense to me, as proposed by Ritchie.

@MarcoGorelli
Copy link
Collaborator

is an extra keyword necessary? can't you implode/explode your way into the output you want

In [31]: pl.select(pl.arange(0, 3, eager=True).implode())
Out[31]:
shape: (1, 1)
┌───────────┐
│ arange    │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [0, 1, 2] │
└───────────┘

In [32]: pl.select(pl.arange(0, 3, eager=True).implode().explode())
Out[32]:
shape: (3, 1)
┌────────┐
│ arange │
│ ---    │
│ i64    │
╞════════╡
│ 0      │
│ 1      │
│ 2      │
└────────┘

?

@stinodego
Copy link
Contributor Author

Your examples are strange to me because you're using eager=True in a select context.

But yes, if you implode the result of eager=True then you get the same output as when running eager=False. But that's the problem - the eager flag changes the behaviour. That flag should only change the compute strategy.

@ritchie46
Copy link
Member

I agree with @stinodego. eager shouldn't change behavior. I think we need an extra argument.

@josh
Copy link
Contributor

josh commented Jun 9, 2023

The other bug is that the lazy dtype is wrong. This also interferes with the implode() workaround.

ldf = pl.LazyFrame().select(pl.date_range(date(2022,1,1), date(2022,1, 3), eager=False))
ldf.schema # {'literal': Date}
df = ldf.collect()
df.schema # {'date': List(Date)}

@MarcoGorelli
Copy link
Collaborator

thanks @josh - I'll check when I'm home, but I'd like to think that that would be fixed by #8591


Regarding the extra keyword (say, orientation), what would the expected output be for

>>> df
shape: (3, 2)
┌────────────┬────────────┐
│ startend        │
│ ------        │
│ datedate       │
╞════════════╪════════════╡
│ 2020-01-012020-01-03 │
│ 2020-01-022020-01-04 │
│ 2020-01-032020-01-05 │
└────────────┴────────────┘
>>> df.with_columns(pl.date_range(pl.col('start'), pl.col('end'), orientation='vertical'))

? Should it just raise in that case?

@stinodego
Copy link
Contributor Author

stinodego commented Jun 12, 2023

I've given this some thought - I don't think we really need a keyword argument. I think it should be:

  • Default behaviour: expression output, list type output.
  • Pass eager=True: simply call pl.select(...).to_series() on the output of the default behaviour.
  • If you want your output "vertically": call explode on the output. No additional parameter.

That would be consistent, though we should document this carefully as it's slightly surprising behavior, at least in an eager context.

EDIT: Actually, that might not quite cover everything.. I'll have to look at this again when I'm behind a PC.

@stinodego
Copy link
Contributor Author

stinodego commented Jun 14, 2023

Idea: let's have a pair of functions:

  • date_range: Generate a single date range column
  • date_ranges: Generate a date range for each row of the input columns.

This solves a bunch of things: we now have a single output type per function, and the naming makes it very clear what to expect.

@MarcoGorelli
Copy link
Collaborator

Nice! The output dtype could still change depending on the time_zone argument, but at least it would resolve the List(T) vs T issue. So, 👍 to that

@stinodego stinodego added this to the Python Polars 1.0.0 milestone Jun 14, 2023
@MarcoGorelli
Copy link
Collaborator

@stinodego is this something you're already working on? If not, I have some time today, could give it a go

@stinodego
Copy link
Contributor Author

Just working on it right now, actually! But I am starting with arange - feel free to take up date_range / time_range if you want!

@MarcoGorelli
Copy link
Collaborator

The only use cases I can think of for date_ranges / time_ranges / aranges is when you have a DataFrame where one column is start and another is stop

In which case, I'd suggest that date_ranges / time_ranges / aranges don't have an eager argument, and only return expressions.
If a use case for time_ranges(..., eager=True) emerges, which isn't covered by time_range(...).implode(), then we can always add it later

@stinodego
Copy link
Contributor Author

Actually you can have this:

start = pl.Series([1,2,3])
pl.int_ranges(start, 5, eager=True)  # -> Returns a Series with multiple ranges

So it should have an eager argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants