ENH: Ability to tz localize when index is implicility in tz #4706

rockg · 2013-08-29T12:40:13Z

Fix to issue #4230 which allows to localize an index which is
implicitly in a tz (e.g., reading from a file) by passing imply_dst to
tz_localize.

jtratner · 2013-08-30T21:08:21Z

pandas/tseries/tests/test_timezones.py

@@ -360,6 +360,22 @@ def test_with_tz_ambiguous_times(self):
        dr = date_range(datetime(2011, 3, 13), periods=48,
                        freq=datetools.Minute(30), tz=pytz.utc)

+    def test_with_tz_ambiguous_times_imply_dst(self):


It would be nice to change the name of this test to test_imply_dst and add 1-2 test cases for when imply_dst is set to True but it's not actually ambiguous. [Changing the name of the test just makes it easier to find the specific test case and understand what it's doing].

jtratner · 2013-08-30T21:15:11Z

@nehalecky - any comments on this?

jtratner · 2013-08-30T21:38:57Z

To me, imply_dst is confusing as a term - it's not clear to me this would do, particularly because it looks like the equivalent pytz keyword accepts either True or False. I think you might mean that the dst is implied from the data, in which case infer_dst would make more sense as the keyword argument.

It looks like pytz calls the marker for this is_dst:

The is_dst parameter is ignored for most timestamps. It is only used during DST transition ambiguous periods to resulve that ambiguity.

If you have some logic that somehow decides whether or not to infer dst (which I think is what you're doing), you could call it infer_dst and specify how you decide whether it should be is_dst=True or is_dst=False. It looks like you're inferring it based upon looking for ordering (i.e., occurring every hour or something) - if that's the case, it would be useful to add a minimal example to the docs that shows how pandas is or is not able to infer timezone. (i.e., something around here - http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-zone-handling )

If imply_dst means that you do the equivalent of parsing every datetime as if you had passed is_dst=True to the pytz constructor, then it would make more sense to allow the value to be either None, True (or truthy) or False and then have it parse with the equivalent of is_dst=False, is_dst=True passed to the pytz constructor. Probably would also make sense to change the name to is_dst and use something like the quoted doc above to explain it.

rockg · 2013-08-31T00:55:23Z

On the exception I was following the code that was already there, however it appears from the pytz source that it just subclasses exception so I will enhance the message.

"infer" is the right verb here. I will make those changes. I was planning on adding the docs, both the api and tz handling sections. Also, it seems like the api docs are lacking mention of DatetimeIndex. Is that intentional or just waiting for somebody to add it?

On your comment above about adding a comment on the line, was there something in particular you want clarified? If I had to guess it would be the trans_idx line, but want to confirm.

jtratner · 2013-08-31T01:09:42Z

On the exception I was following the code that was already there, however it appears from the pytz source that it just subclasses exception so I will enhance the message.

Thanks - there are a bunch of Exception messages that could do with a rewording - it's much more helpful to have a useful message.

I was planning on adding the docs, both the api and tz handling sections. Also, it seems like the api docs are lacking mention of DatetimeIndex. Is that intentional or just waiting for somebody to add it?

I'm pretty sure it's not intentional. If you want to contribute docs for that - it would be great (though maybe you want to put that in a separate PR, just for clarity).

On your comment above about adding a comment on the line, was there something in particular you want clarified? If I had to guess it would be the trans_idx line, but want to confirm.

I just was following everything up to that point - it combines so many things that it's unclear to me. Your call if you want to add a comment explaining what it's doing or not [definitely would want a different set of eyes than mine to look at it then]. I mean this line: trans_idx = np.squeeze(np.nonzero(np.logical_and(both_nat, ~both_eq)))

jtratner · 2013-08-31T01:10:36Z

heh, I got the line wrong too - meant this one trans_grp = np.array_split(trans_idx, np.where(np.diff(trans_idx)!=1)[0]+1)

jtratner · 2013-08-31T01:12:31Z

Also - can you hook up Travis to this? https://github.com/pydata/pandas/blob/master/CONTRIBUTING.md#steps-to-enable-travis-ci

rockg · 2013-09-03T02:30:46Z

@jtratner - incorporated your comments. I wasn't sure how to add another pull request for the DatetimeIndex documentation (having an additional branch seemed excessive). Also, I couldn't think of more tests where the fall transition would be obvious without already having a tz-aware index, in which case it could never be localized. I'm open to suggestions for that. As far as I know I added travis, but perhaps I'm mistaken.

jtratner · 2013-09-03T02:32:13Z

@rockg you got Travis working :) and it definitely makes sense to add the docs with this PR. I'll take a look when I get a chance.

nehalecky · 2013-09-13T01:37:10Z

Hey @rockg + @jtratner

@rockg nice start to this! I had a .md doc going that was attempting to explain how we should be able to infer from a time series for the same reason... just no time to implement! I looked through your source and tests and had just a few thoughts / questions.

the `tz_localize` error

Attempts to localize time DatetimeIndex with df.index.tz_localize(tz) raise both NonExistentTimeError, and AmbiguousTimeError, (both sub classes of pytz.InvalidTimeError base class) and arise for different reasons.

Your PR seems like it only addresses the AmbiguousTimeError encountered during DST Fall transition, is that correct? I also have use cases in parsing time series from a local tz when I hit an NonExistentTimeError for local timestamps formats that can be encountered during a DST Spring transition. For example:
```
import pytz
tz = pytz.timezone('US/Eastern')
tz.localize(pd.datetime(2012, 3, 11, 2, 00), is_dst=None)
```
raises NonExistentTimeError. This is due to pyzt's terse interpretation of the Olsen timezone database, which states, explicitly, that the DST transition occurs at 02:00:00, (only allowing for timestamps advances like: 01:59:59 to 03:00:00). Thus, any formatting practice that instead performs an advance like 02:00:00 to 03:00:01, fails. This might seem like a unique case, but trust me, in my line of work, this type of local timestamp formatting can be found everywhere. :(
Does the PR handle timestamp frequencies less than hourly? In this case, the repeated timestamp count would be equal to the number of sub-hourly frequency (I saw a control flow for number of repeated hours?).

Hope that all makes sense! Let me know if you have any questions. Thanks for tackling this, it's going to be very useful!

rockg · 2013-09-13T02:22:03Z

You are correct that this only addresses the Ambiguous error that occurs in the fall transition. I understand your example (it's second-ending time rather than second-beginning, e.g., the time from 1:59:59 or 2:00:00 can really be represented as either 1:59:59 or 2:00:00 as it is really representing the span between). This is always a choice when determining how to represent times and it would be nice if pytz was more flexible. I don't have any easy thoughts here besides changing your times to be second-beginning which I'm sure you thought of.

Frequencies less than an hour are supported. The code just looks for a negative diff in times (i.e., a time repeated in the hour) and assumes that these are for the non-dst hour. If there is no such diff, then you still have ambiguous times. I think this logic will work for the vast majority of the cases. Can you point your reference to the number of repeated hours?

jreback · 2013-09-20T23:32:19Z

@rockg how's this coming ?

rockg · 2013-09-21T00:15:32Z

I believe it's all set. I incorporated @jtratner's comments and have been using it without issue.

jtratner · 2013-09-21T00:42:47Z

@nehalecky what do you think? you seem to have the best sense of this.

jreback · 2013-09-21T00:46:11Z

@rockg
can you add release note and a mention in v0.13.0 (maybe with a doc reference to the new doc section in timeseries.rst)?

and pls squash down to a smaller number of commits

rockg · 2013-09-21T00:55:34Z

Sure (and I'll also point to the new api documentation). Is there a help on how to combine the commits? Naively I could back out the current changes and recommit but that seems unnecessary and unwieldy.

jreback · 2013-09-21T00:58:37Z

@rockg

you can combine as you see fit....ideally just put the changes that go together, together. you can also put docs/tests/changes in separate. whatever seems logical.

just rebase

git rebase -i <prior to your first comit> then squash/fixup as needed

jtratner · 2013-09-21T01:04:30Z

@rockg to be clear, this means that you'll see something like this:

pick 3502b3b ENH: Add 'is_' method to Index for identity checks
pick b10b44f BUG: Fix copy s.t. it always copies index/columns.
pick e3b4b59 ENH: enable deep copy for sparse Series/Panel

# Rebase b037b0e..e3b4b59 onto b037b0e

change "pick" to "squash" and it'll be all good.

nehalecky · 2013-09-22T00:50:48Z

Hey all.

@rockg, thanks for your replies and again, nice job! While there is more that can be done with DST transition inference, I think that this is a great start and a very helpful feature! I look forward to its incorporation in pandas and hope to contribute to expanding its functionality in the future. :)

jtratner · 2013-09-22T01:12:19Z

after you rebase if @jreback and @nehalecky think it looks good, Ithink we can merge this.

rockg · 2013-09-23T03:38:32Z

@jtratner

Do I then "push" the rebased commits?

cpcloud · 2013-09-23T03:40:11Z

do

git push --force

rockg · 2013-09-23T12:42:30Z

Thanks @cpcloud

I think this is all in good order. There are two commits, one for the new DatetimeIndex api documentation and then another for the tz_localize change.

jreback · 2013-09-26T00:54:17Z

@rockg @cpcloud this looks ok to merge, yes?

cpcloud · 2013-09-26T01:26:08Z

doc/source/api.rst

+.. autosummary::
+   :toctree: generated/
+
+   DatetimeIndex.delete


can you take out the generic Index functions here and just leave the ones that are specific to DatetimeIndex? thanks.

The logical follow-on is that we should document Index as well.

rockg · 2013-09-27T01:53:34Z

@cpcloud @jreback

Updated commits based on comments and add Index documentation. I still have to do the vbench but it's difficult to construct a meaningful (one that has more than 5 points).

jreback · 2013-09-27T01:55:38Z

@rockg just run the current vbench to see if there is any degradation given the changes touch some general ts code (I don't think u will see anything though)

cpcloud · 2013-09-27T03:38:06Z

i meant a new vbench that specifically uses infer_dst bc of all the calls to the numpy Python API from Cython.

jreback · 2013-09-30T12:14:58Z

@rockg ping us when ready with the vbench..thxs

rockg · 2013-10-01T01:23:38Z

@jreback Think it's set. The documentation commit is unchanged.

jreback · 2013-10-01T17:04:43Z

@rockg can you rebase on master just to make sure.....we just merged in nanosecond support so making sure no issues

also can you post the vbench results?

rockg · 2013-10-01T18:08:09Z

@jreback Seems like rebasing is potentially destructive so want to make sure I'm doing it right. Would it be:

git checkout tz-localize-aware
git rebase master

And for vbench, do you want the entire test set pasted here? I ran some yesterday and noticed a couple ratios of 1.6 but it's not clear if that is a result of something unrelated to my changes or not.

jreback · 2013-10-01T18:23:43Z

git rebase -i upstream/master
git push yourfork this_branch_name -f

you may have merge conflicts which you need to resolve

you can just post anything > 1.5 or so (or if you think its somewhat related to your changes)

rockg · 2013-10-02T00:06:10Z

@jreback Need some help here:

git rebase -i upstream/master
fatal: Needed a single revision
invalid upstream upstream/master

Which sounds like an issue in that I have two commits vs one.

Nothing jumps out on the benchmark.

series_drop_duplicates_int                   |   1.3150 |   1.1647 |   1.1291 |
read_store                                   |   3.7667 |   3.3234 |   1.1334 |
stat_ops_level_frame_sum_multiple            |  17.6257 |  15.5220 |   1.1355 |
groupby_multi_different_functions            |  27.7166 |  23.6447 |   1.1722 |
reindex_fillna_pad_float32                   |   0.6173 |   0.5204 |   1.1863 |
reindex_multiindex                           |   2.2800 |   1.9213 |   1.1867 |
frame_repr_tall                              |   4.8803 |   4.1056 |   1.1887 |
frame_to_csv                                 | 293.2223 | 241.2977 |   1.2152 |

cpcloud · 2013-10-02T00:11:36Z

@rockg

# make sure u have a remote ref to upstream
git remote add upstream git://github.com/pydata/pandas.git
git fetch upstream
git rebase upstream/master  # put your changes on top of current master

# after that finishes
git rebase -i upstream/master  # manually squash/fixup/reword etc.

# force push to your remote branch (the PR)
git push --force

jreback · 2013-10-02T00:15:38Z

@rockg those benchmarks look fine....as soon as you rebase we can merge

cpcloud · 2013-10-02T00:15:54Z

you'll need to resolve the merge conflicts in doc/source/release.rst and doc/source/v0.13.0.txt after the first (automatic) rebase. then you can squash ur documentation commit and then force push and you should be good 2 merge.

Start documentation for DatetimeIndex and Index.

rockg · 2013-10-02T00:46:34Z

@cpcloud Thanks, I'm almost there. I believe I resolve the conflicts but in the GitHub Mac tool it shows as deleting a bunch of lines, but they are still in the file. Proceed?

Fix to issue pandas-dev#4230 which allows to localize an index which is implicitly in a tz (e.g., reading from a file) by passing infer_dst to tz_localize.

rockg · 2013-10-02T00:56:17Z

@cpcloud @jreback Thank you for the help. Think it's all good.

ENH: Ability to tz localize when index is implicility in tz

jtratner reviewed Aug 30, 2013
View reviewed changes

rockg mentioned this pull request Sep 12, 2013

API: add is_beg_month/quarter/year, is_end_month/quarter/year accessors (#4565) #4823

Merged

ghost assigned jtratner Sep 22, 2013

cpcloud reviewed Sep 26, 2013
View reviewed changes

jreback mentioned this pull request Sep 28, 2013

DOC: DataFrame() not included in API reference #4790

Closed

Add DatetimeIndex and Index documentation.

354f10a

Start documentation for DatetimeIndex and Index.

ENH: Ability to tz localize when index is implicility in tz

e5ea6c8

Fix to issue pandas-dev#4230 which allows to localize an index which is implicitly in a tz (e.g., reading from a file) by passing infer_dst to tz_localize.

jreback added a commit that referenced this pull request Oct 2, 2013

Merge pull request #4706 from rockg/tz-localize-aware

702e3bb

ENH: Ability to tz localize when index is implicility in tz

jreback merged commit 702e3bb into pandas-dev:master Oct 2, 2013

jreback mentioned this pull request Oct 2, 2013

ENH: Create ability to localize tz with data already in that tz (fall transition) #4230

Closed

rockg mentioned this pull request Jan 1, 2014

please add quick documentation on pandas timestamp values of month, day of week #5813

Closed

ENH: Ability to tz localize when index is implicility in tz #4706

ENH: Ability to tz localize when index is implicility in tz #4706

Conversation

rockg commented Aug 29, 2013

jtratner Aug 30, 2013

Choose a reason for hiding this comment

jtratner commented Aug 30, 2013

jtratner commented Aug 30, 2013

rockg commented Aug 31, 2013

jtratner commented Aug 31, 2013

jtratner commented Aug 31, 2013

jtratner commented Aug 31, 2013

rockg commented Sep 3, 2013

jtratner commented Sep 3, 2013

nehalecky commented Sep 13, 2013

the tz_localize error

rockg commented Sep 13, 2013

jreback commented Sep 20, 2013

rockg commented Sep 21, 2013

jtratner commented Sep 21, 2013

jreback commented Sep 21, 2013

rockg commented Sep 21, 2013

jreback commented Sep 21, 2013

jtratner commented Sep 21, 2013

nehalecky commented Sep 22, 2013

jtratner commented Sep 22, 2013

rockg commented Sep 23, 2013

cpcloud commented Sep 23, 2013

rockg commented Sep 23, 2013

jreback commented Sep 26, 2013

cpcloud Sep 26, 2013

Choose a reason for hiding this comment

rockg Sep 26, 2013

Choose a reason for hiding this comment

rockg commented Sep 27, 2013

jreback commented Sep 27, 2013

cpcloud commented Sep 27, 2013

jreback commented Sep 30, 2013

rockg commented Oct 1, 2013

jreback commented Oct 1, 2013

rockg commented Oct 1, 2013

jreback commented Oct 1, 2013

rockg commented Oct 2, 2013

cpcloud commented Oct 2, 2013

jreback commented Oct 2, 2013

cpcloud commented Oct 2, 2013

rockg commented Oct 2, 2013

rockg commented Oct 2, 2013

the `tz_localize` error