Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement assert_tzawareness_compat for DatetimeIndex #18376

Merged
merged 37 commits into from
Jan 6, 2018

Conversation

jbrockmendel
Copy link
Member

closes #18162

ATM comparing tzaware DatetimeIndex with tznaive DTI fails to raise. This PR implements _assert_tzawareness_compat (which currently exists in _Timestamp) in DatetimeIndex to fix that.

That in turn causes breakage in Series.replace. There's a small edit in core.internals to fix that, but I'm not sure that's the best way to make that work.

There is still one remaining test error in tests.series.test_indexing.TestSeriesIndexing.test_getitem_setitem_datetimeindex, which may either represent a bug or a case where tznaive/tzaware rules are relaxed.

N = 50
rng = date_range('1/1/1990', periods=N, freq='H', tz='US/Eastern')
ts = Series(np.random.randn(N), index=rng)

lb = "1990-01-01 04:00:00"
rb = "1990-01-01 07:00:00"

>>> result = ts[(ts.index >= lb) & (ts.index <= rb)]
TypeError: Cannot compare tz-naive and tz-aware datetime-like objects

Later in the same tests it tries to compare ts.index to naive datetime objects. Is this a special case where we are intentionally less strict?

@@ -649,6 +653,20 @@ def _simple_new(cls, values, name=None, freq=None, tz=None,
result._reset_identity()
return result

def _assert_tzawareness_compat(self, other):
# adapted from _Timestamp._assert_tzawareness_compat
other_tz = getattr(other, 'tzinfo', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is odd to use tzinfo, these are already wrapped scalars, or an index type, so .tz is appropriate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first thought too but other could be a raw datetime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .tz, you can simply wrap other = pd.Timestamp(other)

if isna(s):
return isna(values)
return _maybe_compare(values, getattr(s, 'asm8', s), operator.eq)
if is_datetime64tz_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the right place to fix if this is even an issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree, could use advice on where is the right place. Without this change there is a test failure in tests.indexing.test_coercion.TestReplaceSeriesCoercion.test_replace_series_datetime64tz with traceback

    def test_replace_series_datetime64tz(self):
        from_key = 'datetime64[ns, US/Eastern]'
        for to_key in self.rep:
>           self._assert_replace_conversion(from_key, to_key, how='dict')

pandas/tests/indexing/test_coercion.py:1317: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/tests/indexing/test_coercion.py:1239: in _assert_replace_conversion
    result = obj.replace(replacer)
pandas/core/generic.py:4514: in replace
    limit=limit, regex=regex)
pandas/core/generic.py:4563: in replace
    regex=regex)
pandas/core/internals.py:3504: in replace_list
    masks = [comp(s) for i, s in enumerate(src_list)]
pandas/core/internals.py:3502: in comp
    operator.eq)
pandas/core/internals.py:4954: in _maybe_compare
    result = op(a, b)
pandas/core/indexes/datetimes.py:122: in wrapper
    self._assert_tzawareness_compat(other)

where the relevant obj.replace(replacer) call has

index = pd.Index([3, 4], name='xxx')
data = [Timestamp('2011-01-01 00:00:00-0500', tz='US/Eastern'), Timestamp('2011-01-03 00:00:00-0500', tz='US/Eastern')]
obj = pd.Series(data, index=index, name='yyy')
replacer = {Timestamp('2011-01-01 00:00:00-0500', tz='US/Eastern'): 1.1, Timestamp('2011-01-03 00:00:00-0500', tz='US/Eastern'): 2.2}

The problem is that in internals replace_list is converting to m8 which drops tz.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the place for this, might need to fixe .replace in the block itself.

@gfyoung gfyoung added Datetime Datetime data dtype Bug Timezones Timezone data dtype and removed Datetime Datetime data dtype labels Nov 20, 2017
@jbrockmendel
Copy link
Member Author

#17920 makes some desig decisions about tzawareness strictness. Conditional on those decisions, comparisons should probably use the same conventions.

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

#17920 is orthogonal. That is purely string based. When you use Timestamps, they must be the same timezone (or None) or should raise

@jbrockmendel
Copy link
Member Author

#17920 is orthogonal. That is purely string based. When you use Timestamps, they must be the same timezone (or None) or should raise

See the OP. The unresolved error here is in comparing a tzaware DatetimeIndex against naive-like strings.

rng = pd.date_range('1/1/1990', periods=50, freq='H', tz='US/Eastern')
ts = pd.Series(np.random.randn(50), index=rng)

lb = "1990-01-01 04:00:00"
rb = "1990-01-01 07:00:00"

Under the status quo, the following are equivalent:

ts[lb:rb]
ts[(ts.index >= lb) & (ts.index <= rb)]

#17920 affects the behavior of ts[lb:rb] (though the output is unchanged in this case) while this affects the behavior of ts.index >= lb. If we fix the comparison issue and implement #17920, we break the equivalence of slicing methods. Which may be acceptable; I just want to be explicit about it.

@1kastner
Copy link

1kastner commented Nov 23, 2017

@jbrockmendel Honestly I used some parts of pandas for my project and focussed on that part but I do not have deep insight into the internals and wherelse the changes I suggested could be made. Your comparison seems like an edge case to me but people have different tasks to do with this great library. I would rather go for the split between naive and timezoned datetime-like objects and my UserWarning instead of an error was rather for backwards-compability. Whenever we mix naive and timezoned things, it gets nasty. In my opinion assuming UTC is not good and should be done explicitely by the programmers.

@jbrockmendel
Copy link
Member Author

Increasingly I think we should be strict about this, i.e. be Technically Correct. Anything else is going to require keeping track of what rules are loosened in which special cases, and will inevitably lead to headaches down the road.

That said, an idea for a workaround for slicing tzaware indexes: interpret a trailing "TZ" at the end of a string as "interpret this naive datetime-like string with this DatetimeIndex's timezone".

Is #17920 motivated by a pressing use case where you've got a tzaware DatetimeIndex? I expect the large majority of indexing use cases are tznaive.

@1kastner
Copy link

It is motivated by a use case with data from several different data sources. Some are UTC, some are German Winter Time (without daylight saving time, definitely not a standard time), some are CET etc. I believe that mixing different data sources often has timezone issues as a consequence. But a prove I can not serve.

Adding a TZ is not an option in ISO 8601 as far as I have read it. And I would try to comply to that standard for the labels. Your attempt sounds quite experimential to me.

@jbrockmendel
Copy link
Member Author

Let's move the discussion over to #18435. Closing this PR until the design issue is resolved.

@jbrockmendel
Copy link
Member Author

Reopening. If we're going to introduce inconsistencies, might as well go full-speed ahead and get them sorted out sooner rather than later. Will push update shortly.

@jbrockmendel jbrockmendel reopened this Nov 24, 2017
@codecov
Copy link

codecov bot commented Nov 24, 2017

Codecov Report

Merging #18376 into master will decrease coverage by 0.01%.
The diff coverage is 95.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18376      +/-   ##
==========================================
- Coverage   91.36%   91.34%   -0.02%     
==========================================
  Files         163      163              
  Lines       49704    49723      +19     
==========================================
+ Hits        45411    45420       +9     
- Misses       4293     4303      +10
Flag Coverage Δ
#multiple 89.14% <95.83%> (ø) ⬆️
#single 39.66% <33.33%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 95.54% <100%> (+0.05%) ⬆️
pandas/core/internals.py 94.44% <90%> (-0.03%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aec3347...0f2287c. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 24, 2017

Codecov Report

Merging #18376 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18376      +/-   ##
==========================================
+ Coverage   91.53%   91.53%   +<.01%     
==========================================
  Files         148      148              
  Lines       48688    48701      +13     
==========================================
+ Hits        44566    44579      +13     
  Misses       4122     4122
Flag Coverage Δ
#multiple 89.9% <100%> (ø) ⬆️
#single 41.63% <71.42%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/datetimes.py 95.23% <100%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f2d8db1...31b7336. Read the comment docs.

with pytest.raises(TypeError):
op(left, right)

# Check that there isn't a problem aware-aware and naive-naive do not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests should be in series (the latter ones). alternatively a better place might be tin test_base for these, where we already handle series & index tests.

@@ -649,6 +653,20 @@ def _simple_new(cls, values, name=None, freq=None, tz=None,
result._reset_identity()
return result

def _assert_tzawareness_compat(self, other):
# adapted from _Timestamp._assert_tzawareness_compat
other_tz = getattr(other, 'tzinfo', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .tz, you can simply wrap other = pd.Timestamp(other)

if isna(s):
return isna(values)
return _maybe_compare(values, getattr(s, 'asm8', s), operator.eq)
if is_datetime64tz_dtype(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the place for this, might need to fixe .replace in the block itself.

@jbrockmendel
Copy link
Member Author

good if somebody made kind of an overview of the different related issues

That was the idea behind #18435. Feel free to edit if that if the OP doesn't provide enough overview for your tastes.

And I think we certainly should not regard it is as just a bug fix (eg in the whatsnew notes).

How to treat it in the whatsnew notes is above my pay grade. But over the course of this PR's history I've become increasingly convinced that the current comparison behavior is Just Plain Wrong and should be treated like a bug.

The three options on hand are 1) this PR which makes DatetimeIndex comparisons behave like all other datetime-like comparisons, 2) edit the spec to explicitly make string comparisons a special case, 3) do nothing. I'm going to assume away 3 and argue against 2.

AFAICT the main objection to 1) is that in conjunction with #17920 it breaks the equivalence between ser.loc[lower:upper] and ser[(ser.index >= lower) & (ser.index <= upper)]. But consider what other equivalences are broken by 2:

  • We can no longer count on DatetimeIndex comparisons to be transitive. a >= b and b >= c no longer implies a >= c.
  • Comparison-broadcasting becomes inconsistent. (index >= bound)[n] no longer equals index[n] >= bound (unless we then go mess with Timestamp...)
  • Comparison-boxing becomes inconsistent. index >= bound no longer necessarily equals index >= Timestamp(bound).
  • Box-conversion becomes inconsistent. index >= bound no longer necessarily equals index.astype(object) >= bound

... and most of all, I am not remotely confident that this list is complete. How many places across the code-base do comparisons with DatetimeIndex objects in one place or another? I have no idea.

A tz-aware DatetimeIndex is not easy to get by accident. If a user has one, they have made a decision that timezones matter.


Any of the available options introduces an inconsistency somewhere. AFAICT Option1 breaks a convenience equivalency, will do so loudly, and as a result will not snowball into other inconsistencies.

Special-casing string comparisons generates a whole mess of other potential (often silent) problems that can be avoided by enforcing behavior that is already canonical.

@jorisvandenbossche
Copy link
Member

That was the idea behind #18435. Feel free to edit if that if the OP doesn't provide enough overview for your tastes.

Ah, yes, forgot there was already quite some discussion there. Can you put your last comment there as well? Then I will answer there

@jbrockmendel
Copy link
Member Author

Just pushed a commit that lets tzawareness compat slide for strings, enforces it for everything else.

@jbrockmendel
Copy link
Member Author

Closes #12601

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche with the most recent change this allows strings through without changing current behavior, only checks tzawareness-compat for datetime and vectors. i.e. hopefully this fixes the part that we all agree is a bug without taking a stand on the rest.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Should this also be tested for series/series and series/index comparisons?

@@ -821,6 +821,9 @@ def test_replace_series(self, how, to_key, from_key):
if (from_key.startswith('datetime') and to_key.startswith('datetime')):
pytest.xfail("different tz, currently mask_missing "
"raises SystemError")
elif from_key in ['datetime64[ns, US/Eastern]', 'datetime64[ns, UTC]']:
pytest.xfail(reason='GH #18376, tzawareness-compat bug '
'in BlockManager.replace_list')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be discussed before, but since you need to add a xfail here, is this introducing a regression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess? See above:

so replace_list is a giant hack ATM and needs to be fixed. maybe this is the impetus. However, don't really want to add your hack on top. What exactly is failing w/o you touching replace? let's isolate and x-fail those tests for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an explicit code example that works now and will fail after this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche good catch. Two things here:

  1. Does the fact that this is not immediately obvious by inspection suggest that test parametrization may have been taken a step too far?

  2. In putting together an answer to this question I found that any non-datetime comparison raises, whereas I expect we want DatetimeIndex(...) == "baz" to just be Falsey (following the behavior of Timestamp.__richcmp__ and Timedelta.__richcmp__). So I need to fix this, will answer the original question after that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche An example of a case that xfails after this PR:

index = pd.Index([3, 4], dtype='int64', name=u'xxx')
obj = pd.Series([pd.Timestamp('2011-01-01', tz='US/Eastern'),
                 pd.Timestamp('2011-01-03', tz='US/Eastern')],
                index=index, name='yyy')
replacer = pd.Series([pd.Timedelta(days=1), pd.Timedelta(days=2)],
                     index=obj)

>>> result = obj.replace(replacer)
[...]
TypeError: Cannot compare tz-naive and tz-aware datetime-like objects

@jbrockmendel
Copy link
Member Author

Should this also be tested for series/series and series/index comparisons?

Separately. I've just started looking at Series vs Index arithmetic/comparisons and that's going to be a long process (see #18824). This PR has been a tough slog; I'd really like to get it over with.

@jreback
Copy link
Contributor

jreback commented Dec 21, 2017

this looks fine now that have eliminated the controversial string coercing.

@jorisvandenbossche

@jreback
Copy link
Contributor

jreback commented Dec 23, 2017

rebase

@@ -395,4 +395,5 @@ Categorical
Other
^^^^^

- Fixed bug where comparing :class:`DatetimeIndex` failed to raise ``TypeError`` when attempting to compare timezone-aware and timezone-naive datetimelike objects (:issue:`18162`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to conversaion

@jreback
Copy link
Contributor

jreback commented Jan 5, 2018

whatsnew change, otherwise lgtm.

@jbrockmendel
Copy link
Member Author

Ping

@jreback jreback merged commit 5bd0836 into pandas-dev:master Jan 6, 2018
@jreback
Copy link
Contributor

jreback commented Jan 6, 2018

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DatetimeIndex comparison tzaware vs naive should raise
5 participants