Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize timezoned labels when accessing dataframes. #17920

Closed

Conversation

1kastner
Copy link

@1kastner 1kastner commented Oct 19, 2017

@1kastner
Copy link
Author

Ouch, looks like some for me...

@@ -123,6 +123,30 @@ def test_consistency_with_tz_aware_scalar(self):
result = df[0].at[0]
assert result == expected

def test_access_datetimeindex_with_timezoned_label(self):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the github issue number here as a comment? (see the test after this one for an example)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added now.

@@ -1273,52 +1273,57 @@ def _parsed_string_to_bounds(self, reso, parsed):
lower, upper: pd.Timestamp

"""
if parsed.tzinfo is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you need to do something different here. leave the dates that are generated in the current tz (e.g. self.tz). Then you need to convert it to the parsed_tz, BUT then you need to localize back into the original timezone. There are a number of cases, here's an example.

# index in US/Eastern, parsed in UTC
In [22]: pd.Timestamp('20130101',tz='US/Eastern')
Out[22]: Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern')
00:00+0000', tz='UTC')

In [24]: pd.Timestamp('20130101',tz='US/Eastern').tz_convert('UTC').tz_localize(None).tz_localize('US/Eastern')
Out[24]: Timestamp('2013-01-01 05:00:00-0500', tz='US/Eastern')

# index is naive, parsed is UTC, ineffect no change here
In [25]: pd.Timestamp('20130101')
Out[25]: Timestamp('2013-01-01 00:00:00')

In [26]: pd.Timestamp('20130101').tz_localize('UTC').tz_localize(None)
Out[26]: Timestamp('2013-01-01 00:00:00')

As I am writing this, it looks overly complicated. I might choose instead to raise if the timezones don't match (they can be same tz or both None).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I elaborated in #16785 the timezones do not need to match. There are quite usual cases with daylight savings time that require some flexibility here. So my idea is to change target_tz = parsed.tzinfo to some kind of conversion which you mentioned. Next week I might give it a shot. Dealing with only the timezone of the DatetimeIndex seems reasonable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really tired but tried to find a solution. If it works, great, if not, at least it shows the idea. I will come back to it as soon as possible. Now I am having some issues with installing the development environment under Windows 10 and all solutions found look quite time consuming. Sorry when I spam you with not working code.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Datetime Datetime data dtype Timezones Timezone data dtype labels Oct 20, 2017
@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

pls rebase when you can

@pep8speaks
Copy link

pep8speaks commented Oct 31, 2017

Hello @1kastner! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 26, 2017 at 22:11 Hours UTC

@1kastner
Copy link
Author

I did a rebase and some more minor adjustments.

jbrockmendel and others added 17 commits November 5, 2017 11:00
closes pandas-dev#17979

Author: sfoo <sfoohei@gmail.com>
Author: Jeff Reback <jeff@reback.net>

Closes pandas-dev#17996 from GuessWhoSamFoo/groupby_tuples and squashes the following commits:

afb0031 [Jeff Reback] TST: separate out grouping-type tests
c52b2a8 [sfoo] Moved notes to 0.22; created is_axis_multiindex var - pending internal use
fb52c1c [sfoo] Added whatsnew; checked match_axis_length
99ebc4e [sfoo] Cast groupby tuple as list when multiindex
@1kastner
Copy link
Author

@jreback ping on update

@1kastner
Copy link
Author

The failing test I accidentially got from the master branch.

@jbrockmendel
Copy link
Member

@jreback pls hold off on this pending resolution of #18435.

@jreback
Copy link
Contributor

jreback commented Nov 24, 2017

@jbrockmendel your discussion w.r.t. to this issue is misplaced. We are talking about how to interpret partial strings when you already know the underlying details about the tz's in question. This is just user convenience and is completely orthogonal to other issues you have raised.

@jbrockmendel
Copy link
Member

This is just user convenience and is completely orthogonal to other issues you have raised.

I understand the convenience issue. Consider the two types of slicing/indexing series[lower:upper] and series[(series.index >= lower) & (series.index <= lower)]. This equivalence breaks the orthogonality. If you want to break the equivalence that's fine, I just want that design decision to be made explicit.

@1kastner
Copy link
Author

@jreback Is this pull request ready to go?

@@ -236,7 +236,8 @@ def test_stringified_slice_with_tz(self):
start = datetime.datetime.now()
idx = DatetimeIndex(start=start, freq="1d", periods=10)
df = DataFrame(lrange(10), index=idx)
df["2013-01-14 23:44:34.437768-05:00":] # no exception here
with tm.assert_produces_warning(UserWarning):
df["2013-01-14 23:44:34.437768-05:00":] # no exception here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the # no exception here and add a comment about the warning that is produced

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether that is a good idea. That comment is not mine and it is not related to my code. The proposed refactoring is beyond the scope of this pull request.

@1kastner
Copy link
Author

1kastner commented Nov 26, 2017 via email

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

I am not sure whether that is a good idea. That comment is not mine and it is not related to my code. The proposed refactoring is beyond the scope of this pull request.

@1kastner not sure what you are referring. I had only 1 more comment (small). @jbrockmendel concerns will be addressed elsewhere.

@jbrockmendel
Copy link
Member

@jreback I think the comment @1kastner was referring to was the #no exception comment you asked to have removed. Ambiguity in code-comment vs GH comment.

I would remove the # no exception here

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

if this was the what you were referring

I would remove the # no exception here

then let's add a comparison for that test (to ensure it returns the correct thing).
and add a comment as to why there is a warning.

as to being 'out of scope', well just it. since this code needed to be touched it also needs updating. just because there is a comment that is non-sensical, doesn't mean we leave technical debt in place.

@1kastner
Copy link
Author

@jreback This test checks whether no exception is thrown. Actually it is better to completely remove this test because whatever it tests is covered by my new tests as well

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

@jreback This test checks whether no exception is thrown. Actually it is better to completely remove this test because whatever it tests is covered by my new tests as well

that would be fine

@@ -557,6 +557,50 @@ We are stopping on the included end-point as it is part of the index
dft2 = dft2.swaplevel(0, 1).sort_index()
dft2.loc[idx[:, '2013-01-05'], :]

.. versionadded:: 0.21.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a sub-section label here (with a ref), call it something like slicing with timezones.

# GH 6785
# timezone was ignored when string was provided as a label

first_january = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to parametrize these to avoid the code repetition. so i think you can do it with 2 test functions, one which slices and compares with an expected, and the 2nd function which checks for the warnings (you can actually do it with one if you add some more paramaters)

something like

@pytest.mark.parametrize("tz, start, end",......)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this contradicts the idea of having the df.iloc check suggested by @jorisvandenbossche because that is rather specific. I would rather delete the non-naive UTC test because the CET test shows much more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to write the different strings so that they give the same expected frame I think

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this for 0.22.0 instead of 0.21.1 (since it raises some discussion, I think it is good to have it in master for some time longer)

I think I am in general fine with the changes in this PR (as it is aligning df[lb:up] with df[df.index >= lb & df.index <= up] which already worked with timezones), but I agree with @jbrockmendel that the discussion in #18435 is relevant. I think that we should try to have comparisons, scalar element accessing and partial slicing all work consistent (which is however the case in this PR, so I fine with going forward with this PR is it is for 0.22.0, and we can continue to discuss on the other issue about the more broader aspect).
Something else related to this:

  • If we add a warning when doing partial string indexing with aware string on naive index, we should do the same for accessing a scalar with a string (eg df.loc["2016-01-01T00:00-02:00"] with your example in the docs)


.. note::

This both works with ``pd.Timestamp`` and strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit confusing here. This section is about "partial datetime string indexing", so for me it is confusing to mention Timestamp

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please talk to @jreback who suggested to mention it. Actually it also works for datetime.datetime.

.. ipython:: python
:okwarning:

first_january_implicit_utc = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a much shorter index? (you only need the first 10 to show the actual behaviour)
I would also try to use a shorter variable name here (eg idx_naive)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be shortened but I would keep it a bit longer than the first 10 because of the comparison in the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which comparison?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought (without carefully checking) that maybe in the end I will just compare two empty dataframes which will accidentially happen to be equal. To avoid such wrong positive test I thought having a bit longer df can be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This are the docs, not tests. And you perfectly control what you do in the example, so you can just make it a bit longer than needed for the slicing to see the effect.


df

four_minute_slice = df["2016-01-01T00:00-02:00":"2016-01-01T02:03"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not an example of partial datetime string indexing. The dataframe index has a frequency of minutes, and you provide strings with a minute resolution

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right. What is the consequence in your eyes? I just want the timezones to work, that is my only desire.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no consequence for the behaviour, so this PR will fix your usecase, But for the example in the docs, we should make a clear one. So either I would make this actual partial slicing, or move this section to somewhere else

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then better move it because the timezones can not always be parsed, e.g. for months still UTC will be assumed as it goes through another path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you can just edit the example a little bit. For example keep the minute resolution, and use strings with only hours (instead of the minutes now, and that still provides ability to specify time zone), or change the resolution of the df to seconds, and keep the strings as they are. Note you can do eg each 30s to avoid that selecting some minutes results in many rows.

parsed = parsed.tz_convert(utc)
else:
if parsed.tz is None: # treat like in same timezone
parsed = parsed.tz_localize(self.tz)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case already worked before AFAIK, do you know why this is needed? (although the code seems logical)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is not necessarily needed when it is done somewhere else


result = df[
"2016-01-01T00:00-02:00":"2016-01-01T02:03"
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this all on a single line?

pd.Timestamp("2016-01-01T02:03")
]

tm.assert_frame_equal(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you assert both results (with strings or with Timestamps) with an independelty constructed one? (eg df.iloc[...])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean with df.iloc[...] but generally speaking yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to created 'expected' with something like df.iloc[120:124] (but then with the cirrect numbers)

@@ -637,3 +637,66 @@ def test_partial_set_empty_frame_empty_consistencies(self):
df.loc[0, 'x'] = 1
expected = DataFrame(dict(x=[1], y=[np.nan]))
tm.assert_frame_equal(df, expected, check_dtype=False)

def test_access_timezoned_datetimeindex_with_timezoned_label_utc(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the incorrect file for the tests, as the partial in test_partial.py is not referring partial datetime string indexing, but partial setting or something.
There is a indexes/datetimes/test_partial_slicing.py that seems a better fit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible, this was a suggestion of @jreback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I think it was a typo of @jreback and he forgot the _slicing part. You can look in the current file and see that there is no test related to datetime index slicing, so please make the change


def test_access_timezoned_datetimeindex_with_timezoned_label_utc(self):

# GH 6785
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an incorrect issue number

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh let me have a look how that could happen

@jorisvandenbossche
Copy link
Member

@1kastner Can you also test with an implicit partial string slice? (I mean a single string that represents a slice, without actually slicing, so eg df["2016-01-01T00-02:00"])

@jbrockmendel
Copy link
Member

The slicing code goes through a different branch for monotonic vs non-monotic DatetimeIndexes. Is there a test for this that goes through that path?

@@ -1364,7 +1378,7 @@ def _parsed_string_to_bounds(self, reso, parsed):
st = datetime(parsed.year, parsed.month, parsed.day,
parsed.hour, parsed.minute, parsed.second,
parsed.microsecond)
return (Timestamp(st, tz=self.tz), Timestamp(st, tz=self.tz))
return Timestamp(st, tz=self.tz), Timestamp(st, tz=self.tz)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're already making edits here there's a small bug-like that might be worth fixing. The day, hour, minute, and second cases don't have tz=self.tz passed to the upper half of the returned tuple.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out! By now the timezone is like ignored (or at least not considered well enough) in the later stages but maybe one day that will change.

@1kastner
Copy link
Author

As I need to focus on my work for now, I will not continue for a while. I set that maintainers are allowed to edit the pull request in case of urgent need for action.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2018

can you rebase. let's see where we are on this.

@1kastner
Copy link
Author

I just checked and it means a lot of merging. I do not have the resources to deal with this subject even though the problem itself still bugs me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: validate partial string indexing with tz-aware end-points