-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recognize timezoned labels when accessing dataframes. #17920
Recognize timezoned labels when accessing dataframes. #17920
Conversation
Ouch, looks like some for me... |
@@ -123,6 +123,30 @@ def test_consistency_with_tz_aware_scalar(self): | |||
result = df[0].at[0] | |||
assert result == expected | |||
|
|||
def test_access_datetimeindex_with_timezoned_label(self): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the github issue number here as a comment? (see the test after this one for an example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added now.
pandas/core/indexes/datetimes.py
Outdated
@@ -1273,52 +1273,57 @@ def _parsed_string_to_bounds(self, reso, parsed): | |||
lower, upper: pd.Timestamp | |||
|
|||
""" | |||
if parsed.tzinfo is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you need to do something different here. leave the dates that are generated in the current tz (e.g. self.tz
). Then you need to convert it to the parsed_tz, BUT then you need to localize back into the original timezone. There are a number of cases, here's an example.
# index in US/Eastern, parsed in UTC
In [22]: pd.Timestamp('20130101',tz='US/Eastern')
Out[22]: Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern')
00:00+0000', tz='UTC')
In [24]: pd.Timestamp('20130101',tz='US/Eastern').tz_convert('UTC').tz_localize(None).tz_localize('US/Eastern')
Out[24]: Timestamp('2013-01-01 05:00:00-0500', tz='US/Eastern')
# index is naive, parsed is UTC, ineffect no change here
In [25]: pd.Timestamp('20130101')
Out[25]: Timestamp('2013-01-01 00:00:00')
In [26]: pd.Timestamp('20130101').tz_localize('UTC').tz_localize(None)
Out[26]: Timestamp('2013-01-01 00:00:00')
As I am writing this, it looks overly complicated. I might choose instead to raise if the timezones don't match (they can be same tz or both None).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I elaborated in #16785 the timezones do not need to match. There are quite usual cases with daylight savings time that require some flexibility here. So my idea is to change target_tz = parsed.tzinfo
to some kind of conversion which you mentioned. Next week I might give it a shot. Dealing with only the timezone of the DatetimeIndex seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really tired but tried to find a solution. If it works, great, if not, at least it shows the idea. I will come back to it as soon as possible. Now I am having some issues with installing the development environment under Windows 10 and all solutions found look quite time consuming. Sorry when I spam you with not working code.
pls rebase when you can |
…ror-on-non-naive-datetime-strings
Hello @1kastner! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on November 26, 2017 at 22:11 Hours UTC |
I did a rebase and some more minor adjustments. |
closes pandas-dev#17979 Author: sfoo <sfoohei@gmail.com> Author: Jeff Reback <jeff@reback.net> Closes pandas-dev#17996 from GuessWhoSamFoo/groupby_tuples and squashes the following commits: afb0031 [Jeff Reback] TST: separate out grouping-type tests c52b2a8 [sfoo] Moved notes to 0.22; created is_axis_multiindex var - pending internal use fb52c1c [sfoo] Added whatsnew; checked match_axis_length 99ebc4e [sfoo] Cast groupby tuple as list when multiindex
…ror-on-non-naive-datetime-strings
@jreback ping on update |
The failing test I accidentially got from the master branch. |
@jbrockmendel your discussion w.r.t. to this issue is misplaced. We are talking about how to interpret partial strings when you already know the underlying details about the tz's in question. This is just user convenience and is completely orthogonal to other issues you have raised. |
I understand the convenience issue. Consider the two types of slicing/indexing |
@jreback Is this pull request ready to go? |
@@ -236,7 +236,8 @@ def test_stringified_slice_with_tz(self): | |||
start = datetime.datetime.now() | |||
idx = DatetimeIndex(start=start, freq="1d", periods=10) | |||
df = DataFrame(lrange(10), index=idx) | |||
df["2013-01-14 23:44:34.437768-05:00":] # no exception here | |||
with tm.assert_produces_warning(UserWarning): | |||
df["2013-01-14 23:44:34.437768-05:00":] # no exception here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the # no exception here
and add a comment about the warning that is produced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether that is a good idea. That comment is not mine and it is not related to my code. The proposed refactoring is beyond the scope of this pull request.
I am not sure whether that is a good idea. That comment is not mine and it is not related to my code. The proposed refactoring is beyond the scope of this pull request.
*EDIT*: I published this via my email client and as @jbrockmendel pointed out my comment was meant to be for your change request.
|
@1kastner not sure what you are referring. I had only 1 more comment (small). @jbrockmendel concerns will be addressed elsewhere. |
if this was the what you were referring
then let's add a comparison for that test (to ensure it returns the correct thing). as to being 'out of scope', well just it. since this code needed to be touched it also needs updating. just because there is a comment that is non-sensical, doesn't mean we leave technical debt in place. |
@jreback This test checks whether no exception is thrown. Actually it is better to completely remove this test because whatever it tests is covered by my new tests as well |
that would be fine |
…ror-on-non-naive-datetime-strings
@@ -557,6 +557,50 @@ We are stopping on the included end-point as it is part of the index | |||
dft2 = dft2.swaplevel(0, 1).sort_index() | |||
dft2.loc[idx[:, '2013-01-05'], :] | |||
|
|||
.. versionadded:: 0.21.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a sub-section label here (with a ref), call it something like slicing with timezones
.
# GH 6785 | ||
# timezone was ignored when string was provided as a label | ||
|
||
first_january = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really like to parametrize these to avoid the code repetition. so i think you can do it with 2 test functions, one which slices and compares with an expected, and the 2nd function which checks for the warnings (you can actually do it with one if you add some more paramaters)
something like
@pytest.mark.parametrize("tz, start, end",......)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this contradicts the idea of having the df.iloc
check suggested by @jorisvandenbossche because that is rather specific. I would rather delete the non-naive UTC test because the CET test shows much more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to write the different strings so that they give the same expected frame I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave this for 0.22.0 instead of 0.21.1 (since it raises some discussion, I think it is good to have it in master for some time longer)
I think I am in general fine with the changes in this PR (as it is aligning df[lb:up]
with df[df.index >= lb & df.index <= up]
which already worked with timezones), but I agree with @jbrockmendel that the discussion in #18435 is relevant. I think that we should try to have comparisons, scalar element accessing and partial slicing all work consistent (which is however the case in this PR, so I fine with going forward with this PR is it is for 0.22.0, and we can continue to discuss on the other issue about the more broader aspect).
Something else related to this:
- If we add a warning when doing partial string indexing with aware string on naive index, we should do the same for accessing a scalar with a string (eg
df.loc["2016-01-01T00:00-02:00"]
with your example in the docs)
|
||
.. note:: | ||
|
||
This both works with ``pd.Timestamp`` and strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit confusing here. This section is about "partial datetime string indexing", so for me it is confusing to mention Timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please talk to @jreback who suggested to mention it. Actually it also works for datetime.datetime
.
.. ipython:: python | ||
:okwarning: | ||
|
||
first_january_implicit_utc = pd.date_range('2016-01-01T00:00', '2016-01-01T23:59', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this a much shorter index? (you only need the first 10 to show the actual behaviour)
I would also try to use a shorter variable name here (eg idx_naive
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be shortened but I would keep it a bit longer than the first 10 because of the comparison in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought (without carefully checking) that maybe in the end I will just compare two empty dataframes which will accidentially happen to be equal. To avoid such wrong positive test I thought having a bit longer df can be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This are the docs, not tests. And you perfectly control what you do in the example, so you can just make it a bit longer than needed for the slicing to see the effect.
|
||
df | ||
|
||
four_minute_slice = df["2016-01-01T00:00-02:00":"2016-01-01T02:03"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not an example of partial datetime string indexing. The dataframe index has a frequency of minutes, and you provide strings with a minute resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right. What is the consequence in your eyes? I just want the timezones to work, that is my only desire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no consequence for the behaviour, so this PR will fix your usecase, But for the example in the docs, we should make a clear one. So either I would make this actual partial slicing, or move this section to somewhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then better move it because the timezones can not always be parsed, e.g. for months still UTC will be assumed as it goes through another path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you can just edit the example a little bit. For example keep the minute resolution, and use strings with only hours (instead of the minutes now, and that still provides ability to specify time zone), or change the resolution of the df to seconds, and keep the strings as they are. Note you can do eg each 30s to avoid that selecting some minutes results in many rows.
parsed = parsed.tz_convert(utc) | ||
else: | ||
if parsed.tz is None: # treat like in same timezone | ||
parsed = parsed.tz_localize(self.tz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case already worked before AFAIK, do you know why this is needed? (although the code seems logical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is not necessarily needed when it is done somewhere else
|
||
result = df[ | ||
"2016-01-01T00:00-02:00":"2016-01-01T02:03" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this all on a single line?
pd.Timestamp("2016-01-01T02:03") | ||
] | ||
|
||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you assert both results (with strings or with Timestamps) with an independelty constructed one? (eg df.iloc[...]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean with df.iloc[...]
but generally speaking yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean to created 'expected' with something like df.iloc[120:124] (but then with the cirrect numbers)
@@ -637,3 +637,66 @@ def test_partial_set_empty_frame_empty_consistencies(self): | |||
df.loc[0, 'x'] = 1 | |||
expected = DataFrame(dict(x=[1], y=[np.nan])) | |||
tm.assert_frame_equal(df, expected, check_dtype=False) | |||
|
|||
def test_access_timezoned_datetimeindex_with_timezoned_label_utc(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the incorrect file for the tests, as the partial in test_partial.py
is not referring partial datetime string indexing, but partial setting or something.
There is a indexes/datetimes/test_partial_slicing.py
that seems a better fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible, this was a suggestion of @jreback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I think it was a typo of @jreback and he forgot the _slicing part. You can look in the current file and see that there is no test related to datetime index slicing, so please make the change
|
||
def test_access_timezoned_datetimeindex_with_timezoned_label_utc(self): | ||
|
||
# GH 6785 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an incorrect issue number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh let me have a look how that could happen
@1kastner Can you also test with an implicit partial string slice? (I mean a single string that represents a slice, without actually slicing, so eg |
The slicing code goes through a different branch for monotonic vs non-monotic DatetimeIndexes. Is there a test for this that goes through that path? |
@@ -1364,7 +1378,7 @@ def _parsed_string_to_bounds(self, reso, parsed): | |||
st = datetime(parsed.year, parsed.month, parsed.day, | |||
parsed.hour, parsed.minute, parsed.second, | |||
parsed.microsecond) | |||
return (Timestamp(st, tz=self.tz), Timestamp(st, tz=self.tz)) | |||
return Timestamp(st, tz=self.tz), Timestamp(st, tz=self.tz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're already making edits here there's a small bug-like that might be worth fixing. The day, hour, minute, and second cases don't have tz=self.tz
passed to the upper half of the returned tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out! By now the timezone is like ignored (or at least not considered well enough) in the later stages but maybe one day that will change.
As I need to focus on my work for now, I will not continue for a while. I set that maintainers are allowed to edit the pull request in case of urgent need for action. |
can you rebase. let's see where we are on this. |
I just checked and it means a lot of merging. I do not have the resources to deal with this subject even though the problem itself still bugs me. |
git diff upstream/master -u -- "*.py" | flake8 --diff