-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arange
and date_range
expressions differ in dimensions
#9019
Comments
The idea is that you can create a date range per row. Imagine having a column We tried to infer the behavior earlier, but this turns out to lead to inconsistencies, so I think we should accept a keyword that defines if we want to evaluate a single element or a column of elements. |
sure but might you also want to create an The current behaviour, based on the last commit, is In [3]: pl.select(pl.date_range(date(2022,1,1), date(2022,1, 3), eager=True))
Out[3]:
shape: (3, 1)
┌────────────┐
│ │
│ --- │
│ date │
╞════════════╡
│ 2022-01-01 │
│ 2022-01-02 │
│ 2022-01-03 │
└────────────┘
In [4]: pl.select(pl.date_range(date(2022,1,1), date(2022,1, 3), eager=False))
Out[4]:
shape: (1, 1)
┌───────────────────────────────────┐
│ literal │
│ --- │
│ list[date] │
╞═══════════════════════════════════╡
│ [2022-01-01, 2022-01-02, 2022-01… │
└───────────────────────────────────┘
In [5]: pl.select(pl.arange(0, 3, eager=True))
Out[5]:
shape: (3, 1)
┌────────┐
│ arange │
│ --- │
│ i64 │
╞════════╡
│ 0 │
│ 1 │
│ 2 │
└────────┘
In [6]: pl.select(pl.arange(0, 3, eager=False))
Out[6]:
shape: (3, 1)
┌─────────┐
│ literal │
│ --- │
│ i64 │
╞═════════╡
│ 0 │
│ 1 │
│ 2 │
└─────────┘ For consistency, perhaps the last one should be shape: (1, 1)
┌───────────┐
│ literal │
│ --- │
│ list[i64] │
╞═══════════╡
│ [0, 1, 2] │
└───────────┘ ? |
Yeap, let's do that. |
Yes, that was what I was getting at. But that's not the full story. We have two behaviours:
And we have two compute strategies:
In your example, you're using the eager/lazy toggle to change the behaviour, and that is undesirable. So an extra keyword for the behaviour would make sense to me, as proposed by Ritchie. |
is an extra keyword necessary? can't you implode/explode your way into the output you want In [31]: pl.select(pl.arange(0, 3, eager=True).implode())
Out[31]:
shape: (1, 1)
┌───────────┐
│ arange │
│ --- │
│ list[i64] │
╞═══════════╡
│ [0, 1, 2] │
└───────────┘
In [32]: pl.select(pl.arange(0, 3, eager=True).implode().explode())
Out[32]:
shape: (3, 1)
┌────────┐
│ arange │
│ --- │
│ i64 │
╞════════╡
│ 0 │
│ 1 │
│ 2 │
└────────┘ ? |
Your examples are strange to me because you're using But yes, if you implode the result of |
I agree with @stinodego. |
The other bug is that the lazy dtype is wrong. This also interferes with the ldf = pl.LazyFrame().select(pl.date_range(date(2022,1,1), date(2022,1, 3), eager=False))
ldf.schema # {'literal': Date}
df = ldf.collect()
df.schema # {'date': List(Date)} |
thanks @josh - I'll check when I'm home, but I'd like to think that that would be fixed by #8591 Regarding the extra keyword (say, >>> df
shape: (3, 2)
┌────────────┬────────────┐
│ start ┆ end │
│ --- ┆ --- │
│ date ┆ date │
╞════════════╪════════════╡
│ 2020-01-01 ┆ 2020-01-03 │
│ 2020-01-02 ┆ 2020-01-04 │
│ 2020-01-03 ┆ 2020-01-05 │
└────────────┴────────────┘
>>> df.with_columns(pl.date_range(pl.col('start'), pl.col('end'), orientation='vertical')) ? Should it just raise in that case? |
I've given this some thought - I don't think we really need a keyword argument. I think it should be:
That would be consistent, though we should document this carefully as it's slightly surprising behavior, at least in an eager context. EDIT: Actually, that might not quite cover everything.. I'll have to look at this again when I'm behind a PC. |
Idea: let's have a pair of functions:
This solves a bunch of things: we now have a single output type per function, and the naming makes it very clear what to expect. |
Nice! The output dtype could still change depending on the |
@stinodego is this something you're already working on? If not, I have some time today, could give it a go |
Just working on it right now, actually! But I am starting with |
The only use cases I can think of for In which case, I'd suggest that |
Actually you can have this: start = pl.Series([1,2,3])
pl.int_ranges(start, 5, eager=True) # -> Returns a Series with multiple ranges So it should have an |
Consider the following behaviour for
arange
:And compare it with the behaviour of
date_range
:arange
creates an expression with lengthn
, whiledate_range
creates a list-type expression with a single item: a list of lengthn
.To me, the
date_range
behaviour is surprising. Forarange
, the eager and lazy variants are similar in dimension, while fordate_range
, they are completely different (requires explode/implode to convert between the two).What's the reason
date_range
behaves this way in lazy? And should we have the same behaviour forarange
?The text was updated successfully, but these errors were encountered: