Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Ability to tz localize when index is implicility in tz #4706

Merged
merged 2 commits into from
Oct 2, 2013

Conversation

rockg
Copy link
Contributor

@rockg rockg commented Aug 29, 2013

Fix to issue #4230 which allows to localize an index which is
implicitly in a tz (e.g., reading from a file) by passing imply_dst to
tz_localize.

@@ -360,6 +360,22 @@ def test_with_tz_ambiguous_times(self):
dr = date_range(datetime(2011, 3, 13), periods=48,
freq=datetools.Minute(30), tz=pytz.utc)

def test_with_tz_ambiguous_times_imply_dst(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to change the name of this test to test_imply_dst and add 1-2 test cases for when imply_dst is set to True but it's not actually ambiguous. [Changing the name of the test just makes it easier to find the specific test case and understand what it's doing].

@jtratner
Copy link
Contributor

@nehalecky - any comments on this?

@jtratner
Copy link
Contributor

To me, imply_dst is confusing as a term - it's not clear to me this would do, particularly because it looks like the equivalent pytz keyword accepts either True or False. I think you might mean that the dst is implied from the data, in which case infer_dst would make more sense as the keyword argument.

It looks like pytz calls the marker for this is_dst:

The is_dst parameter is ignored for most timestamps. It is only used during DST transition ambiguous periods to resulve that ambiguity.

If you have some logic that somehow decides whether or not to infer dst (which I think is what you're doing), you could call it infer_dst and specify how you decide whether it should be is_dst=True or is_dst=False. It looks like you're inferring it based upon looking for ordering (i.e., occurring every hour or something) - if that's the case, it would be useful to add a minimal example to the docs that shows how pandas is or is not able to infer timezone. (i.e., something around here - http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-zone-handling )

If imply_dst means that you do the equivalent of parsing every datetime as if you had passed is_dst=True to the pytz constructor, then it would make more sense to allow the value to be either None, True (or truthy) or False and then have it parse with the equivalent of is_dst=False, is_dst=True passed to the pytz constructor. Probably would also make sense to change the name to is_dst and use something like the quoted doc above to explain it.

@rockg
Copy link
Contributor Author

rockg commented Aug 31, 2013

On the exception I was following the code that was already there, however it appears from the pytz source that it just subclasses exception so I will enhance the message.

"infer" is the right verb here. I will make those changes. I was planning on adding the docs, both the api and tz handling sections. Also, it seems like the api docs are lacking mention of DatetimeIndex. Is that intentional or just waiting for somebody to add it?

On your comment above about adding a comment on the line, was there something in particular you want clarified? If I had to guess it would be the trans_idx line, but want to confirm.

@jtratner
Copy link
Contributor

On the exception I was following the code that was already there, however it appears from the pytz source that it just subclasses exception so I will enhance the message.

Thanks - there are a bunch of Exception messages that could do with a rewording - it's much more helpful to have a useful message.

I was planning on adding the docs, both the api and tz handling sections. Also, it seems like the api docs are lacking mention of DatetimeIndex. Is that intentional or just waiting for somebody to add it?

I'm pretty sure it's not intentional. If you want to contribute docs for that - it would be great (though maybe you want to put that in a separate PR, just for clarity).

On your comment above about adding a comment on the line, was there something in particular you want clarified? If I had to guess it would be the trans_idx line, but want to confirm.

I just was following everything up to that point - it combines so many things that it's unclear to me. Your call if you want to add a comment explaining what it's doing or not [definitely would want a different set of eyes than mine to look at it then]. I mean this line: trans_idx = np.squeeze(np.nonzero(np.logical_and(both_nat, ~both_eq)))

@jtratner
Copy link
Contributor

heh, I got the line wrong too - meant this one trans_grp = np.array_split(trans_idx, np.where(np.diff(trans_idx)!=1)[0]+1)

@jtratner
Copy link
Contributor

@rockg
Copy link
Contributor Author

rockg commented Sep 3, 2013

@jtratner - incorporated your comments. I wasn't sure how to add another pull request for the DatetimeIndex documentation (having an additional branch seemed excessive). Also, I couldn't think of more tests where the fall transition would be obvious without already having a tz-aware index, in which case it could never be localized. I'm open to suggestions for that. As far as I know I added travis, but perhaps I'm mistaken.

@jtratner
Copy link
Contributor

jtratner commented Sep 3, 2013

@rockg you got Travis working :) and it definitely makes sense to add the docs with this PR. I'll take a look when I get a chance.

@nehalecky
Copy link
Contributor

Hey @rockg + @jtratner

@rockg nice start to this! I had a .md doc going that was attempting to explain how we should be able to infer from a time series for the same reason... just no time to implement! I looked through your source and tests and had just a few thoughts / questions.

the tz_localize error

Attempts to localize time DatetimeIndex with df.index.tz_localize(tz) raise both NonExistentTimeError, and AmbiguousTimeError, (both sub classes of pytz.InvalidTimeError base class) and arise for different reasons.

  • Your PR seems like it only addresses the AmbiguousTimeError encountered during DST Fall transition, is that correct? I also have use cases in parsing time series from a local tz when I hit an NonExistentTimeError for local timestamps formats that can be encountered during a DST Spring transition. For example:

    import pytz
    tz = pytz.timezone('US/Eastern')
    tz.localize(pd.datetime(2012, 3, 11, 2, 00), is_dst=None)

    raises NonExistentTimeError. This is due to pyzt's terse interpretation of the Olsen timezone database, which states, explicitly, that the DST transition occurs at 02:00:00, (only allowing for timestamps advances like: 01:59:59 to 03:00:00). Thus, any formatting practice that instead performs an advance like 02:00:00 to 03:00:01, fails. This might seem like a unique case, but trust me, in my line of work, this type of local timestamp formatting can be found everywhere. :(

  • Does the PR handle timestamp frequencies less than hourly? In this case, the repeated timestamp count would be equal to the number of sub-hourly frequency (I saw a control flow for number of repeated hours?).

Hope that all makes sense! Let me know if you have any questions. Thanks for tackling this, it's going to be very useful!

@rockg
Copy link
Contributor Author

rockg commented Sep 13, 2013

You are correct that this only addresses the Ambiguous error that occurs in the fall transition. I understand your example (it's second-ending time rather than second-beginning, e.g., the time from 1:59:59 or 2:00:00 can really be represented as either 1:59:59 or 2:00:00 as it is really representing the span between). This is always a choice when determining how to represent times and it would be nice if pytz was more flexible. I don't have any easy thoughts here besides changing your times to be second-beginning which I'm sure you thought of.

Frequencies less than an hour are supported. The code just looks for a negative diff in times (i.e., a time repeated in the hour) and assumes that these are for the non-dst hour. If there is no such diff, then you still have ambiguous times. I think this logic will work for the vast majority of the cases. Can you point your reference to the number of repeated hours?

@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

@rockg how's this coming ?

@rockg
Copy link
Contributor Author

rockg commented Sep 21, 2013

I believe it's all set. I incorporated @jtratner's comments and have been using it without issue.

@jtratner
Copy link
Contributor

@nehalecky what do you think? you seem to have the best sense of this.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@rockg
can you add release note and a mention in v0.13.0 (maybe with a doc reference to the new doc section in timeseries.rst)?

and pls squash down to a smaller number of commits

@rockg
Copy link
Contributor Author

rockg commented Sep 21, 2013

Sure (and I'll also point to the new api documentation). Is there a help on how to combine the commits? Naively I could back out the current changes and recommit but that seems unnecessary and unwieldy.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@rockg

you can combine as you see fit....ideally just put the changes that go together, together. you can also put docs/tests/changes in separate. whatever seems logical.

just rebase

git rebase -i <prior to your first comit> then squash/fixup as needed

@jtratner
Copy link
Contributor

@rockg to be clear, this means that you'll see something like this:

pick 3502b3b ENH: Add 'is_' method to Index for identity checks
pick b10b44f BUG: Fix copy s.t. it always copies index/columns.
pick e3b4b59 ENH: enable deep copy for sparse Series/Panel

# Rebase b037b0e..e3b4b59 onto b037b0e

change "pick" to "squash" and it'll be all good.

@nehalecky
Copy link
Contributor

Hey all.

@rockg, thanks for your replies and again, nice job! While there is more that can be done with DST transition inference, I think that this is a great start and a very helpful feature! I look forward to its incorporation in pandas and hope to contribute to expanding its functionality in the future. :)

@ghost ghost assigned jtratner Sep 22, 2013
@jtratner
Copy link
Contributor

after you rebase if @jreback and @nehalecky think it looks good, Ithink we can merge this.

@rockg
Copy link
Contributor Author

rockg commented Sep 23, 2013

@jtratner

Do I then "push" the rebased commits?

@cpcloud
Copy link
Member

cpcloud commented Sep 23, 2013

do

git push --force

@rockg
Copy link
Contributor Author

rockg commented Sep 23, 2013

Thanks @cpcloud

I think this is all in good order. There are two commits, one for the new DatetimeIndex api documentation and then another for the tz_localize change.

@jreback
Copy link
Contributor

jreback commented Sep 26, 2013

@rockg @cpcloud this looks ok to merge, yes?

.. autosummary::
:toctree: generated/

DatetimeIndex.delete
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you take out the generic Index functions here and just leave the ones that are specific to DatetimeIndex? thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logical follow-on is that we should document Index as well.

@rockg
Copy link
Contributor Author

rockg commented Sep 27, 2013

@cpcloud @jreback

Updated commits based on comments and add Index documentation. I still have to do the vbench but it's difficult to construct a meaningful (one that has more than 5 points).

@jreback
Copy link
Contributor

jreback commented Sep 27, 2013

@rockg just run the current vbench to see if there is any degradation given the changes touch some general ts code (I don't think u will see anything though)

@cpcloud
Copy link
Member

cpcloud commented Sep 27, 2013

i meant a new vbench that specifically uses infer_dst bc of all the calls to the numpy Python API from Cython.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

@rockg ping us when ready with the vbench..thxs

@rockg
Copy link
Contributor Author

rockg commented Oct 1, 2013

@jreback Think it's set. The documentation commit is unchanged.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2013

@rockg can you rebase on master just to make sure.....we just merged in nanosecond support so making sure no issues

also can you post the vbench results?

@rockg
Copy link
Contributor Author

rockg commented Oct 1, 2013

@jreback Seems like rebasing is potentially destructive so want to make sure I'm doing it right. Would it be:

  1. git checkout tz-localize-aware
  2. git rebase master

And for vbench, do you want the entire test set pasted here? I ran some yesterday and noticed a couple ratios of 1.6 but it's not clear if that is a result of something unrelated to my changes or not.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2013

git rebase -i upstream/master
git push yourfork this_branch_name -f

you may have merge conflicts which you need to resolve

you can just post anything > 1.5 or so (or if you think its somewhat related to your changes)

@rockg
Copy link
Contributor Author

rockg commented Oct 2, 2013

@jreback Need some help here:

git rebase -i upstream/master
fatal: Needed a single revision
invalid upstream upstream/master

Which sounds like an issue in that I have two commits vs one.

Nothing jumps out on the benchmark.

series_drop_duplicates_int                   |   1.3150 |   1.1647 |   1.1291 |
read_store                                   |   3.7667 |   3.3234 |   1.1334 |
stat_ops_level_frame_sum_multiple            |  17.6257 |  15.5220 |   1.1355 |
groupby_multi_different_functions            |  27.7166 |  23.6447 |   1.1722 |
reindex_fillna_pad_float32                   |   0.6173 |   0.5204 |   1.1863 |
reindex_multiindex                           |   2.2800 |   1.9213 |   1.1867 |
frame_repr_tall                              |   4.8803 |   4.1056 |   1.1887 |
frame_to_csv                                 | 293.2223 | 241.2977 |   1.2152 |

@cpcloud
Copy link
Member

cpcloud commented Oct 2, 2013

@rockg

# make sure u have a remote ref to upstream
git remote add upstream git://github.com/pydata/pandas.git
git fetch upstream
git rebase upstream/master  # put your changes on top of current master

# after that finishes
git rebase -i upstream/master  # manually squash/fixup/reword etc.

# force push to your remote branch (the PR)
git push --force

@jreback
Copy link
Contributor

jreback commented Oct 2, 2013

@rockg those benchmarks look fine....as soon as you rebase we can merge

@cpcloud
Copy link
Member

cpcloud commented Oct 2, 2013

you'll need to resolve the merge conflicts in doc/source/release.rst and doc/source/v0.13.0.txt after the first (automatic) rebase. then you can squash ur documentation commit and then force push and you should be good 2 merge.

Start documentation for DatetimeIndex and Index.
@rockg
Copy link
Contributor Author

rockg commented Oct 2, 2013

@cpcloud Thanks, I'm almost there. I believe I resolve the conflicts but in the GitHub Mac tool it shows as deleting a bunch of lines, but they are still in the file. Proceed?

gitmadness

Fix to issue pandas-dev#4230 which allows to localize an index which is
implicitly in a tz (e.g., reading from a file) by passing infer_dst to
tz_localize.
@rockg
Copy link
Contributor Author

rockg commented Oct 2, 2013

@cpcloud @jreback Thank you for the help. Think it's all good.

jreback added a commit that referenced this pull request Oct 2, 2013
ENH: Ability to tz localize when index is implicility in tz
@jreback jreback merged commit 702e3bb into pandas-dev:master Oct 2, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants