Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: preserve fold in Timestamp.replace #37644

Merged
merged 9 commits into from
Nov 8, 2020

Conversation

AlexKirko
Copy link
Member

@AlexKirko AlexKirko commented Nov 5, 2020

Problem

We currently lose fold information (whether the Timestamp corresponds to the first or second instance of wall clock time in a DST transition) when calling Timestamp.replace.

Solution

A simple addition to the code of Timestamp.replace should fix this.

Test

I added the test for the use case I came up with in the issue discussion. The OP example is losing fold when deleting timezone information with Timetsamp.replace, and that's not really a bug, but replacing a valid dateutil timezone with itself and losing fold definitely is.

The proposed solution fixes the original example as well. I just don't think we should be tracking it in tests, as it's not clear to me why fold must be preserved in a Timestamp with no timezone information (but it is the convention recommended in PEP 495, fold section to let users keep invalid fold and to just ignore it).

Some details

IIRC, we ignore fold, when it doesn't do anything, so we should be safe preserving fold while replacing tzinfo with None, as in the OP example. I remember this coming up when we introduced fold support, and we made sure that the fold-aware functions don't care what fold is outside of a DST transition with a dateutil timezone (this was done to satisfy the requirements of PEP 495, fold section).

@AlexKirko
Copy link
Member Author

All green, should be ready for a review.

@@ -1190,3 +1190,13 @@ def test_tz_localize_invalidates_freq():
dti2 = dti[:1]
result = dti2.tz_localize("US/Eastern")
assert result.freq == "H"


def test_replace_preserves_fold():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should go here: pandas/tests/scalar/timestamp/test_unary_ops.py (there is another replace test)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback
Copy link
Contributor

jreback commented Nov 5, 2020

cc @pganssle @jbrockmendel

@jreback jreback added this to the 1.2 milestone Nov 5, 2020
@jreback jreback added Datetime Datetime data dtype Timezones Timezone data dtype Bug labels Nov 5, 2020
# GH 37610. Check that replace preserves Timestamp fold property
tz = gettz("Europe/Moscow")

result = Timestamp(1256427000000000000, tz=tz, unit="ns").replace(tzinfo=tz).fold
Copy link
Contributor

@pganssle pganssle Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  1. It should be much more obvious that this thing started with the fold set correctly. This could fail for reasons unrelated to the thing you are testing.
  2. You should test that it works both ways fold=0fold=0 and fold=1fold=1.
  3. When you do replace(tzinfo=tz), you are creating an identical timestamp. I suspect that it would be very reasonable for a future version of Python or Pandas to optimize that case and simply return the original Timestamp/datetime if nothing is changed, making this not an ideal test.
  4. This method-chaining style is not going to be great in terms of tracebacks if the test fails. The same line covers three separate things that could fail: constructing the Timestamp, executing replace and accessing the .fold property. I would do these on separate lines.

Probably something more like this:

zone = gettz("America/New_York")

ts = Timestamp(2020, 11, 1, 1, 30, fold=fold, tzinfo=zone)
ts_replaced = ts.replace(microseconds=1)
assert ts_replaced.fold == fold

Where fold is parameterized over [0, 1].

Copy link
Member Author

@AlexKirko AlexKirko Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pganssle Thanks! I had made changes to the test, instead testing our behavior versus datetime, but your suggestion makes more sense. This way, we don't care whether datetime works properly (on the slim chance that it breaks). Introduced the changes. Hope you don't mind me keeping the OP timezone.

Please take a look at the new version.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make the replacing of tzinfo work as a reasonable test if you change to the other timezone with the same time of DST change, e.g Moscow to Novosibirsk

Copy link
Member Author

@AlexKirko AlexKirko Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexeyDmitriev That would work, but also make the test a bit less readable (we'd need to comment that we are swapping between two DST-zones, and both times are in a fold). I think leaving the current test should be okay.

# GH 37610. Check that replace preserves Timestamp fold property
tz = gettz("Europe/Moscow")

ts = Timestamp(year=2009, month=10, day=25, hour=2, minute=30, fold=fold, tz=tz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why tz= instead of tzinfo=? I seem to remember that there are vague plans to remove tz= as redundant or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tz kwarg allows for a string whereas tzinfo requires a tzinfo object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's not what's happening here...

Copy link
Member Author

@AlexKirko AlexKirko Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, agreed. Passing a tzinfo object into the tz argument gives the same result (as all tzinfo gives us is a Cython type-check before we set tz, tzinfo = tzinfo, None), but passing tzinfo is more readable and doc-compliant.
Changed it, please take a look.

@AlexKirko
Copy link
Member Author

@jreback @jbrockmendel @pganssle
Accepted all suggestions, all green, detailed comments above.

@jbrockmendel
Copy link
Member

Looks fine to me, but I'm going to agree with whatever @pganssle says here

@jreback jreback merged commit 84bee25 into pandas-dev:master Nov 8, 2020
@jreback
Copy link
Contributor

jreback commented Nov 8, 2020

thanks @AlexKirko

@AlexKirko AlexKirko deleted the preserve-fold-ts-replace branch November 9, 2020 13:10
jreback added a commit that referenced this pull request Nov 13, 2020
… (#37655)

* Moving the file test_frame.py to a new directory

* Сreated file test_frame_color.py

* Transfer tests
of test_frame.py
to test_frame_color.py

* PEP 8 fixes

* Transfer tests

of test_frame.py
to test_frame_groupby.py and test_frame_subplots.py

* Removing unnecessary imports

* PEP 8 fixes

* Fixed class name

* Transfer tests

of test_frame.py
to test_frame_subplots.py

* Transfer tests

of test_frame.py
to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py

* Changed class names

* Removed unnecessary imports

* Removed import

* catch FutureWarnings (#37587)

* TST/REF: collect indexing tests by method (#37590)

* REF: prelims for single-path setitem_with_indexer (#37588)

* ENH: __repr__ for 2D DTA/TDA (#37164)

* CLN: de-duplicate _validate_where_value with _validate_setitem_value (#37595)

* TST/REF: collect tests by method (#37589)

* TST/REF: move remaining setitem tests from test_timeseries

* TST/REF: rehome test_timezones test

* move misplaced arithmetic test

* collect tests by method

* move misplaced file

* REF: Categorical.is_dtype_equal -> categories_match_up_to_permutation (#37545)

* CLN refactor non-core (#37580)

* refactor core/computation (#37585)

* TST/REF: share method tests between DataFrame and Series (#37596)

* BUG: Index.where casting ints to str (#37591)

* REF: IntervalArray comparisons (#37124)

* regression fix for merging DF with datetime index with empty DF (#36897)

* ERR: fix error message in Period for invalid frequency (#37602)

* CLN: remove rebox_native (#37608)

* TST/REF: tests.generic (#37618)

* TST: collect tests by method (#37617)

* TST/REF: collect test_timeseries tests by method

* misplaced DataFrame.values tst

* misplaced dataframe.values test

* collect test by method

* TST/REF: share tests across Series/DataFrame (#37616)

* Gh 36562 typeerror comparison not supported between float and str (#37096)

* docs: fix punctuation (#37612)

* REGR: pd.to_hdf(..., dropna=True) not dropping missing rows (#37564)

* parametrize set_axis tests (#37619)

* CLN: clean color selection in _matplotlib/style (#37203)

* DEPR: DataFrame/Series.slice_shift (#37601)

* REF: re-use validate_setitem_value in Categorical.fillna (#37597)

* PERF: release gil for ewma_time (#37389)

* BUG: Groupy dropped nan groups from result when grouping over single column (#36842)

* ENH: implement timeszones support for read_json(orient='table') and astype() from 'object' (#35973)

* REF/BUG/TYP: read_csv shouldn't close user-provided file handles (#36997)

* BUG/REF: read_csv shouldn't close user-provided file handles

* get_handle: typing, returns is_wrapped, use dataclass, and make sure that all created handlers are returned

* remove unused imports

* added IOHandleArgs.close

* added IOArgs.close

* mostly comments

* move memory_map from TextReader to CParserWrapper

* moved IOArgs and IOHandles

* more comments

Co-authored-by: Jeff Reback <jeff@reback.net>

* more typing checks to pre-commit (#37539)

* TST: 32bit dtype compat test_groupby_dropna (#37623)

* BUG: Metadata propagation for groupby iterator (#37461)

* BUG: read-only values in cython funcs (#37613)

* CLN refactor core/arrays (#37581)

* Fixed Metadata Propogation in DataFrame (#37381)

* TYP: add Shape alias to pandas._typing (#37128)

* DOC: Fix typo (#37630)

* CLN: parametrize test_nat_comparisons (#37195)

* dataframe dataclass docstring updated (#37632)

* refactor core/groupby (#37583)

* BUG: set index of DataFrame.apply(f) when f returns dict (#37544) (#37606)

* BUG: to_dict should return a native datetime object for NumPy backed dataframes (#37571)

* ENH: memory_map for compressed files (#37621)

* DOC: add example & prose of slicing with labels when index has duplicate labels  (#36814)

* DOC: add example & prose of slicing with labels when index has duplicate labels #36251

* DOC: proofread the sentence.

Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local>

* DOC: Fix typo (#37636)

"columns(s)" sounded odd, I believe it was supposed to be just "column(s)".

* CI: troubleshoot win py38 builds (#37652)

* TST/REF: collect indexing tests by method (#37638)

* TST/REF: collect tests for get_numeric_data (#37634)

* misplaced loc test

* TST/REF: collect get_numeric_data tests

* REF: de-duplicate _validate_insert_value with _validate_scalar (#37640)

* CI: catch windows py38 OSError (<