Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in constructing dataframes with timezone-aware datatypes #16823

Closed
2 tasks done
maxzw opened this issue Jun 8, 2024 · 8 comments · Fixed by #16828
Closed
2 tasks done

Inconsistency in constructing dataframes with timezone-aware datatypes #16823

maxzw opened this issue Jun 8, 2024 · 8 comments · Fixed by #16828
Labels
A-input-parsing Area: parsing input arguments bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@maxzw
Copy link

maxzw commented Jun 8, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame(
    schema={"dt": pl.Datetime(time_zone="Europe/Amsterdam")},
    data=[
        datetime(2021, 1, 1, 12),
        datetime(2021, 1, 1, 13),
    ],
)
print(df)

df = pl.DataFrame(
    schema={"dt": pl.Datetime(time_zone="Europe/Amsterdam")},
    data=[
        [datetime(2021, 1, 1, 12)],
        [datetime(2021, 1, 1, 13)],
    ],
)
print(df)

Log output

shape: (2, 1)
┌────────────────────────────────┐
│ dt                             │
│ ---                            │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2021-01-01 12:00:00 CET        │
│ 2021-01-01 13:00:00 CET        │
└────────────────────────────────┘
shape: (2, 1)
┌────────────────────────────────┐
│ dt                             │
│ ---                            │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2021-01-01 13:00:00 CET        │
│ 2021-01-01 14:00:00 CET        │
└────────────────────────────────┘

Issue description

Depending on wether the dataframe is initialised as a list of lists or list of items it gives different results.

Expected behavior

I would expect both dataframes to be the same.

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.10.0 (default, Mar  3 2022, 03:54:28) [Clang 12.0.0 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.1
connectorx:           <not installed>
deltalake:            0.17.4
fastexcel:            <not installed>
fsspec:               2023.9.0
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.7.2
nest_asyncio:         1.6.0
numpy:                1.26.2
openpyxl:             <not installed>
pandas:               2.1.2
pyarrow:              14.0.1
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.20
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@maxzw maxzw added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 8, 2024
@maxzw
Copy link
Author

maxzw commented Jun 8, 2024

@MarcoGorelli you stated in #16297 the following:

"naive datetime, time zone => parse into given time zone. ✅ same"

Apparently this does not always seem to be true!

@maxzw
Copy link
Author

maxzw commented Jun 8, 2024

Another example:

df = pl.DataFrame(
    schema={"value": pl.Struct({"dt": pl.Datetime(time_zone="Europe/Amsterdam")})},
    data=[
        [{"dt": datetime(2021, 1, 1, 12)}],
        [{"dt": datetime(2021, 1, 1, 13)}],
    ],
)
print(df)
shape: (2, 1)
┌────────────────────────────────┐
│ dt                             │
│ ---                            │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2021-01-01 13:00:00 CET        │
│ 2021-01-01 14:00:00 CET        │
└────────────────────────────────┘

@MarcoGorelli
Copy link
Collaborator

thanks @maxzw for reporting

this looks quite problematic

fortunately, 1.0 is just around the corner, so it's the perfect moment to address this

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Jun 8, 2024

One solution is to modify the second constructor you brought up, with some extra logic here:

#[cfg(feature = "dtype-datetime")]
(Datetime(builder, tu_l, _), AnyValue::Datetime(v, tu_r, _)) => {
// we convert right tu to left tu
// so we swap.
let v = convert_time_units(v, tu_r, *tu_l);
builder.append_value(v)
},

like

match (tz_l, tz_r) {
   Some(tz_l), None => val.replace_time_zone(Some(tz_l)),
   None, Some(tz_r) => val.convert_time_zone(Some("UTC")),
   None, None => val,
   Some(tz_l), Some(tz_r) => val.convert_time_zone(Some(tz_l))
}

Alternatively, the reverse could be done - keep that constructor the way it is, and make all the others only ever convert to the given time zone (never replace), with naive datetimes getting converted from UTC. This would be a (partial) departure from pandas, and aligned with PyArrow.
Due to the partial behaviour change, I think Polars should also raise a warning if constructing a Series with naive datetimes and a tz-aware dtype


The struct constructor seems a lot harder to align to the others' behaviour...

@MarcoGorelli
Copy link
Collaborator

Right, so to summarise, one constructor follows PyArrow (convert everything to the given time zone, naive datetimes get converted as if from UTC), and the other follows pandas (for naive datetimes replace their time zone with the dtype's one, for tz-aware ones convert to the dtype's one):

In [15]: pl.DataFrame([[datetime(2020,1,1)], [datetime(2020,1,2)]], schema={'a': pl.Datetime('us', 'Europe/Amsterdam')})
Out[15]:
shape: (2, 1)
┌────────────────────────────────┐
│ a                              │
│ ---                            │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2020-01-01 01:00:00 CET        │
│ 2020-01-02 01:00:00 CET        │
└────────────────────────────────┘

In [16]: pa.table({'a': pa.array([datetime(2020, 1, 1)], type=pa.timestamp('us', 'Europe/Amsterdam'))})['a'][0]
Out[16]: <pyarrow.TimestampScalar: '2020-01-01T01:00:00.000000+0100'>

In [17]: pl.DataFrame({'a': [datetime(2020,1,1), datetime(2020,1,2)]}, schema={'a': pl.Datetime('us', 'Europe/Amsterdam')})
Out[17]:
shape: (2, 1)
┌────────────────────────────────┐
│ a                              │
│ ---                            │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2020-01-01 00:00:00 CET        │
│ 2020-01-02 00:00:00 CET        │
└────────────────────────────────┘

In [18]: pd.DataFrame([[datetime(2020, 1, 1)]], dtype='datetime64[us, Europe/Amsterdam]')
Out[18]:
                          0
0 2020-01-01 00:00:00+01:00

Well, this sucks. Whichever direction Polars goes in, is going to break someone's code

I think the simplest and most predictable thing to do might be to:

  • consistently follow the PyArrow behaviour here, as it's simpler and easier to teach
  • always raise a user warning when constructing a dataframe with naive datetimes and a tz-aware dtype. For nested dtypes this may not be possible without a performance hit, but if it can at least be done in other cases, it should save a lot of headaches

What an unfortunate situation

I'll try putting together a PR to do the above suggestion, then we'll see how bad it is

@deanm0000
Copy link
Collaborator

I think this is what i tried to address with #14211 but it's just been pending ¯_(ツ)_/¯

@MarcoGorelli
Copy link
Collaborator

I think this is what i tried to address with #14211

I think it's not quite the same - this issue is about the replace vs convert inconsistency (even though the result dtype is the same), the linked issues in #14211 are about the resulting dtype (even though they consistently convert to UTC)

but it's just been pending ¯_(ツ)_/¯

Sorry about that - will get to it after this, it might simplify the logic

@stinodego stinodego added the A-input-parsing Area: parsing input arguments label Jun 10, 2024
@stinodego
Copy link
Member

Thanks for working this one out!

consistently follow the PyArrow behaviour here, as it's simpler and easier to teach

Agree, this seems the most intuitive to me.

@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-input-parsing Area: parsing input arguments bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants