-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] date64 arrays do not round-trip through pandas conversion #38050
Comments
Interesting. This is because the After roundtripping, the data has become more "correct":
But as long as we allow to store milliseconds that are not a multiple of a single day, then we should also ignore those sub-day milliseconds in operations like equality. For example, Our format spec says:
So the question is whether we should always truncate the values when creating, or rather deal with sub-day milliseconds later on. |
I see! Thanks for the thorough explanation. My opinion is that logical data types like date64 should be a semantic layer on top of the physical data. I think that PyArrow should accept the possibility that the physical data doesn't conform to its semantic expectations, so it should be able to work with data with sub-day milliseconds, especially if they come from some foreign, non-pyarrow source. I think that means that equality should be changed, like you say, since that's a semantic statement. But always truncating the physical data seems too extreme - I'd prefer that PyArrow preserve whatever it was given. Maybe constructors from "raw" sources (Python lists, maybe Pandas Series) should truncate, though? Anyway - I think I agree that the compute logic should change. It seems likely that many compute operations would need to change, though. For example, all the hash operations - would we need to always truncate before any compute operator is applied? |
Yeah, and for that reason, it might make more sense to always truncate when constructing from external sources (even for numpy arrays), so that within arrow, we can assume that it's always a multiple, and don't have to check for this in every kernel |
Especially for constructors from python objects, that wouldn't be zero-copy anyway (like your initial example of |
Describe the bug, including details regarding any error messages, version, and platform.
In PyArrow, Date64Array values do not maintain precision when being loaded from pandas by
pa.array
.For example, let's make a date64 array value, and convert it to a pandas Series, taking care to avoid using datetime objects:
If one prints
pc.subtract(date64_roundtripped, date64_array)
, you can see that they are different:Note that this does not occur for date32:
It appears to me that
date64_pd
is just fine. It prints as this:One hint at whats going on is to use
pa.Array.from_pandas
. That actually returns a `TimestampArray:The issue might be that conversion from TimestampArray to Date64 array drops precision, maybe.
Component(s)
Python
The text was updated successfully, but these errors were encountered: