-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistency in constructing dataframes with timezone-aware datatypes #16823
Comments
@MarcoGorelli you stated in #16297 the following:
Apparently this does not always seem to be true! |
Another example: df = pl.DataFrame(
schema={"value": pl.Struct({"dt": pl.Datetime(time_zone="Europe/Amsterdam")})},
data=[
[{"dt": datetime(2021, 1, 1, 12)}],
[{"dt": datetime(2021, 1, 1, 13)}],
],
)
print(df) shape: (2, 1)
┌────────────────────────────────┐
│ dt │
│ --- │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2021-01-01 13:00:00 CET │
│ 2021-01-01 14:00:00 CET │
└────────────────────────────────┘ |
thanks @maxzw for reporting this looks quite problematic fortunately, 1.0 is just around the corner, so it's the perfect moment to address this |
One solution is to modify the second constructor you brought up, with some extra logic here: polars/crates/polars-core/src/frame/row/av_buffer.rs Lines 96 to 102 in 4bbdac3
like
Alternatively, the reverse could be done - keep that constructor the way it is, and make all the others only ever convert to the given time zone (never replace), with naive datetimes getting converted from UTC. This would be a (partial) departure from pandas, and aligned with PyArrow. The struct constructor seems a lot harder to align to the others' behaviour... |
Right, so to summarise, one constructor follows PyArrow (convert everything to the given time zone, naive datetimes get converted as if from UTC), and the other follows pandas (for naive datetimes replace their time zone with the dtype's one, for tz-aware ones convert to the dtype's one): In [15]: pl.DataFrame([[datetime(2020,1,1)], [datetime(2020,1,2)]], schema={'a': pl.Datetime('us', 'Europe/Amsterdam')})
Out[15]:
shape: (2, 1)
┌────────────────────────────────┐
│ a │
│ --- │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2020-01-01 01:00:00 CET │
│ 2020-01-02 01:00:00 CET │
└────────────────────────────────┘
In [16]: pa.table({'a': pa.array([datetime(2020, 1, 1)], type=pa.timestamp('us', 'Europe/Amsterdam'))})['a'][0]
Out[16]: <pyarrow.TimestampScalar: '2020-01-01T01:00:00.000000+0100'>
In [17]: pl.DataFrame({'a': [datetime(2020,1,1), datetime(2020,1,2)]}, schema={'a': pl.Datetime('us', 'Europe/Amsterdam')})
Out[17]:
shape: (2, 1)
┌────────────────────────────────┐
│ a │
│ --- │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2020-01-01 00:00:00 CET │
│ 2020-01-02 00:00:00 CET │
└────────────────────────────────┘
In [18]: pd.DataFrame([[datetime(2020, 1, 1)]], dtype='datetime64[us, Europe/Amsterdam]')
Out[18]:
0
0 2020-01-01 00:00:00+01:00 Well, this sucks. Whichever direction Polars goes in, is going to break someone's code I think the simplest and most predictable thing to do might be to:
What an unfortunate situation I'll try putting together a PR to do the above suggestion, then we'll see how bad it is |
I think this is what i tried to address with #14211 but it's just been pending ¯_(ツ)_/¯ |
I think it's not quite the same - this issue is about the
Sorry about that - will get to it after this, it might simplify the logic |
Thanks for working this one out!
Agree, this seems the most intuitive to me. |
Checks
Reproducible example
Log output
Issue description
Depending on wether the dataframe is initialised as a list of lists or list of items it gives different results.
Expected behavior
I would expect both dataframes to be the same.
Installed versions
The text was updated successfully, but these errors were encountered: