Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index.astype does not localize to UTC first like Series.astype with datetime64[ns, tz] dtypes #33401

Closed
3 tasks done
mroeschke opened this issue Apr 8, 2020 · 14 comments
Closed
3 tasks done
Labels
API Design Bug Needs Discussion Requires discussion from core team before further action Timezones Timezone data dtype

Comments

@mroeschke
Copy link
Member

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

In [4]: pd.__version__
Out[4]: '1.1.0.dev0+1202.g1de2cf1ac'

In [5]: pd.Series([pd.Timestamp('2020-01-01 15:00')]).astype('datetime64[ns, EST]')
Out[5]:
0   2020-01-01 10:00:00-05:00
dtype: datetime64[ns, EST]

In [6]: pd.Index([pd.Timestamp('2020-01-01 15:00')]).astype('datetime64[ns, EST]')
Out[6]: DatetimeIndex(['2020-01-01 15:00:00-05:00'], dtype='datetime64[ns, EST]', freq=None)

Problem description

Out[5] is the expected output as documented in https://pandas.pydata.org/docs/user_guide/timeseries.html#time-zone-series-operations. Out[6] should align therefore with Out[5]

Expected Output

In [6]: pd.Index([pd.Timestamp('2020-01-01 15:00')]).astype('datetime64[ns, EST]')
Out[6]: DatetimeIndex(['2020-01-01 10:00:00-05:00'], dtype='datetime64[ns, EST]', freq=None)
@mroeschke mroeschke added Bug Needs Triage Issue that has not been reviewed by a pandas team member Timezones Timezone data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 8, 2020
@mroeschke mroeschke changed the title BUG: BUG: Index.astype does not localize to UTC first like Series.astype with datetime64[ns, tz] dtypes Apr 8, 2020
@jbrockmendel
Copy link
Member

cc @jorisvandenbossche @TomAugspurger fixing this inconsistency would move us a step closer to not-special-casing DatetimeTZBlock

@TomAugspurger
Copy link
Contributor

Cool. Agreed with the expected output.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 13, 2020

I am fully on board with making this consistent; but I am not fully sure I find the output of Series more expected than the output of DatetimeIndex' astype (it's rather subjective though).

The docs say that astype can "localize and convert" tz-naive timestamps, but doesn't explicitly says "localize to UTC and convert".

Basically the two options (expressed with an explicit localize):

In [12]: s_naive = pd.Series(pd.date_range('20130101', periods=3))  

In [13]: s_naive.astype('datetime64[ns, EST]')
Out[13]: 
0   2012-12-31 19:00:00-05:00
1   2013-01-01 19:00:00-05:00
2   2013-01-02 19:00:00-05:00
dtype: datetime64[ns, EST]

In [15]: s_naive.dt.tz_localize("UTC").astype('datetime64[ns, EST]') 
Out[15]: 
0   2012-12-31 19:00:00-05:00
1   2013-01-01 19:00:00-05:00
2   2013-01-02 19:00:00-05:00
dtype: datetime64[ns, EST]

In [16]: s_naive.dt.tz_localize("EST").astype('datetime64[ns, EST]')
Out[16]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, EST]

In the astype code, we also explicitly do a tz_localize with the new timezone, indicating it is likely done on purpose and not an accident (so might be worth looking at the history of this change).

Now, regardless of my comment about the expected result: is this something we want to simply change? (which is a breaking change) Or can we deprecate it first? (I think users can then avoid the warning by using tz_localize explicitly)
Given that this is about DatetimeIndex, the index version might also be used quite a lot.

@jorisvandenbossche
Copy link
Member

On second thought, the "localize and convert" in the docs probably mean "localize to UTC and convert", indeed as @mroeschke suggested. Because otherwise if would just be a "localize" of course (there is nothing to convert anymore when localizing to the target timezone ..)

@jorisvandenbossche
Copy link
Member

So do we actually have arguments for either behaviours? (apart from "because Series is doing it that way")

@jbrockmendel
Copy link
Member

I think the Technically Correct answer may be to raise and require users to use tz_localize

The semantics of astype usually mean "represent the same information in another dtype". When going from tznaive<->tzaware, no matter which convention we use, the result does not represent "the same information"

@jreback
Copy link
Contributor

jreback commented Oct 13, 2020

I recall the initial implementation was pure convenience and in retrospect was a mistake. think we should actually deprecate this entirely. Being explicit about using .tz_localize or .tz_convert is a huge win here.

@jbrockmendel
Copy link
Member

For Series/DatetimeIndex/DatetimeArray there is an easy "do this instead". what about DataFrame.astype?

@jorisvandenbossche
Copy link
Member

I did a bit of history search with git blame to see where the type of "localize" behaviour was introduced. For Series.astype, it was in #11003. For Index.astype, it was in #18937 (so a bit later, but introducing a different behaviour as Series.astype).
But there was not much discussion about it that could be enlightening (in case of Index.astype, it was actually fixing a bug, because before that PR it was completely broken, removing any tz)

This issue was actually also opened in response to #33399, where someone reported the current Series.astype behaviour as confusing (expecting it to localize to the specified tz, and not localizing to UTC and then converting to the specified tz).
Of course if we want to keep this astype ability, we need to choice one of both, and some people will always expect the other behaviour. But just as another point of evidence that this is quite ambiguous.


As another example, compare the Series constructor vs astype:

In [18]: pd.Series([pd.Timestamp("2012-01-01")], dtype='datetime64[ns, EST]')
Out[18]: 
0   2012-01-01 00:00:00-05:00
dtype: datetime64[ns, EST]

In [19]: pd.Series([pd.Timestamp("2012-01-01")]).astype('datetime64[ns, EST]') 
Out[19]: 
0   2011-12-31 19:00:00-05:00
dtype: datetime64[ns, EST]

I think it could also be reasonable to expect the same result.

That actually reminds me of the long discussion we had in #24559 (although that was mostly about how to interpret integers, but it still re-confirmed the interpretation of tz-naive timestamps as "wall-time" in constructors)

@jorisvandenbossche jorisvandenbossche added API Design Needs Discussion Requires discussion from core team before further action labels Oct 19, 2020
@jbrockmendel
Copy link
Member

@jorisvandenbossche it sounds like you lean towards making [19] match [18]? Thoughts on the just-require-tz_localize approach?

I haven't had any bright ideas on how to handle the DataFrame.astype({key: dtype}) case if we require explicit tz_localize.

@jorisvandenbossche
Copy link
Member

it sounds like you lean towards making [19] match [18]?

I would at least make them consistent. And when looking at both options, I think I personally find the output of [18] more logical than [19]

Simply disallowing would solve any ambiguity, but I think the practicality of having some way to do this with astype (eg to cover the DataFrame case you mention) is for me a good enough reason to keep it.

@jbrockmendel
Copy link
Member

Related: different DTI vs Series constructor behavior, based on test_construction_consistency

ser = pd.Series(pd.date_range("20130101", periods=3, tz="US/Eastern"))

result = pd.Series(ser.values, dtype=ser.dtype)
alt = pd.Series(pd.DatetimeIndex(ser.values, dtype=ser.dtype))

>>> result
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

>>> alt
0   2013-01-01 05:00:00-05:00
1   2013-01-02 05:00:00-05:00
2   2013-01-03 05:00:00-05:00
dtype: datetime64[ns, US/Eastern]

@jbrockmendel
Copy link
Member

A point in favor of the Series behavior is that it round-trips correctly ser.astype("M8[ns]").astype(ser.dtype).equals(ser) while it does not for DTI. (unless we dont like the astype("M8[ns]") behavior)

@mroeschke
Copy link
Member Author

This was deprecated in #39258 which makes this bug a nonissue. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Needs Discussion Requires discussion from core team before further action Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants