Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with datetime and HDFStore #809

Closed
mattias-lundell opened this issue Feb 21, 2012 · 8 comments
Closed

Problem with datetime and HDFStore #809

mattias-lundell opened this issue Feb 21, 2012 · 8 comments
Labels
Datetime Datetime data dtype Enhancement
Milestone

Comments

@mattias-lundell
Copy link

When storing a DataFrame using HDFStore the datetime information is altered. My guess is that there is some problem with daylight saving and time zones when the DataFrame is loaded from the h5 file. An example:

In [1]: from pandas import *

In [2]: df = DataFrame([0,1], [datetime(2011, 3, 27, 2, 2, 2),datetime(2011, 3, 27, 3, 2, 2)])

In [3]: s = HDFStore("test.h5")

In [4]: s["test"] = df

In [5]: df
Out[5]: 
                     0
2011-03-27 02:02:02  0
2011-03-27 03:02:02  1

In [6]: s["test"]
Out[6]: 
                     0
2011-03-27 03:02:02  0
2011-03-27 03:02:02  1
@adamklein
Copy link
Contributor

I get something else. What else is weird is DST is 3/13 in 2011, not 3/27.

What pandas.version do you have, what OS & python? If linux, what are your LC_ variables? (ie, run set | grep "LC" in bash)

## -- End pasted text --

In [2]: df
Out[2]: 
                     0
2011-03-27 02:02:02  0
2011-03-27 03:02:02  1

In [3]: s["test"]
Out[3]: 
                     0
2011-03-27 02:02:02  0
2011-03-27 03:02:02  1

@mattias-lundell
Copy link
Author

Aha, I thought that it was the same date across all countries but that was not the case. According to http://www.timeanddate.com/time/dst/2011.html there are several different dates.

Running locale says "en_US.utf8" but I am located in Sweden and in Sweden the DST was the 27th (probably something broken in my locale). By the way, what is the correct behavior? Maybe it's just me that handles the data wrong.

I'm running Ubuntu 11.04, Python 2.7.1 and pandas-0.7.0rc1-py2.7-linux-x86_64.egg.

@mattias-lundell
Copy link
Author

My use case is that I read data stored in CSV files. I load them into DataFrames. So far, so good. The problem occurs when persisting the DataFrame in a h5 file. When I load the data from the h5 file, I receive a DataFrame that has an index containing duplicate entries.

@adamklein
Copy link
Contributor

You are right about DST being different where you are :)

Since there is no timezone information attached, the principal of least surprise would suggest it should return exactly what you stored.

However, looking into pandas/io/pytables.py, going into storage, it does:

time.mktime(v.timetuple())
Docstring: Convert a time tuple in local time to seconds since the Epoch.

And coming out,

[datetime.fromtimestamp(v) for v in data]
Docstring:  timestamp[, tz] -> tz's local time from POSIX timestamp.

Ok, so the problem is this: 2:02 on 3/27 is actually a non-existent time, and 2:02 == 3:02. How your locale knows you are in Sweden and your posix API takes advantage of this, I have no idea. Can you prefilter the data before storing?

But even stranger, I cannot reproduce the behavior for me on 3/13, in my timezone (EST5EDT) when I should see the exact same behavior.

@adamklein
Copy link
Contributor

I assume one would want the option to save data in non-standard (ie, no daylight saving time even during daylight saving time period). i'll see whether this is easy.

@mattias-lundell
Copy link
Author

Thank you for looking into this.

I would definitively profit from having the option of storing and reading data from other timezones than my local.

@adamklein
Copy link
Contributor

I think the best way to deal with this for now is to set the timezone information on your original dates. I.e., if x is a datetime, x = x.replace(tzinfo=pytz.UTC). When it comes out the other side, it should conform to your local time properly. We should have improved time zone handling in 0.8 along with the datetime64 type.

@wesm
Copy link
Member

wesm commented May 13, 2012

Timestamp data is all represented internally as UTC (even though may appear to be in one time zone vs. another) and should not have any locale issues in pandas 0.8.0. See #1232 re storing time zones in HDFStore, will be done soon

@wesm wesm closed this as completed May 13, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement
Projects
None yet
Development

No branches or pull requests

3 participants