Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: round-tripping non-nano datetime64s with to_json/read_json #55827

Open
jbrockmendel opened this issue Nov 4, 2023 · 9 comments
Open
Labels
Bug IO JSON read_json, to_json, json_normalize Non-Nano datetime64/timedelta64 with non-nanosecond resolution

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Nov 4, 2023

import pandas as pd
import pandas._testing as tm
from io import StringIO

df = pd.DataFrame(
    {
        "A": pd.to_datetime(["2013-01-01", "2013-01-02"]).as_unit("s"),
        "B": [3.5, 3.5],
    }
)

written = df.to_json(orient="split")

>>> written
'{"A":{"0":1356,"1":1357},"B":{"0":3.5,"1":3.5}}'

result = pd.read_json(StringIO(written), orient="split", convert_dates=["A"])

>>> result
      A    B
0  1356  3.5
1  1357  3.5

tm.assert_frame_equal(result, df)   # <- fails

The example here is based on test_frame_non_unique_columns, altered by 1) making the columns into ["A", "B"] and 2) changing the dtype for the first column from M8[ns] to M8[s].

This goes through a check in _try_convert_to_date:

        # ignore numbers that are out of range
        if issubclass(new_data.dtype.type, np.number):
            in_range = (
                isna(new_data._values)
                | (new_data > self.min_stamp)
                | (new_data._values == iNaT)
            )
            if not in_range.all():
                return data, False

when the json is produced from M8[s] (or M8[ms]) data, these values are all under self.min_stamp, so this check causes us to short-circuit and not go through the pd.to_datetime conversion that comes just after (which itself looks sketchy but that can wait for another day).

cc @WillAyd my best guess is that there is nothing we can do at the reading stage and we should convert non-nano to nano at the writing stage, or maybe just warn users that they are doing something that doesn't round-trip?

Surfaced while implementing #55564 (which will cause users to get non-nano in many cases where they currently get nano).

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 4, 2023
@WillAyd
Copy link
Member

WillAyd commented Nov 4, 2023

Hmm shouldn't to_json be writing this as a date to begin with? I would be hesitant to make any guarantees on lossless-ness if we are just writing out the integer in to_json

@jbrockmendel
Copy link
Member Author

shouldn't to_json be writing this as a date to begin with?

no idea about the "should", but it isn't. still integers if we don't do the .as_unit (though surprisingly, not quite just multiply-by-10**9)

@WillAyd
Copy link
Member

WillAyd commented Nov 4, 2023

Gotcha. Well I think that's a bug we need to investigate more with to_json, though ultimately even when we write dates I think we'd have the question as to what precision to read in since that is not going to be embedded in the JSON.

My initial thought is that we should just stick with ns as the default unless a user cares to specify. The alternative I think is to find the best precision to fit a date as it is read in, but I'd be happy to defer that until someone really requests it

@rhshadrach rhshadrach added Non-Nano datetime64/timedelta64 with non-nanosecond resolution IO JSON read_json, to_json, json_normalize labels Nov 5, 2023
@WillAyd
Copy link
Member

WillAyd commented Dec 1, 2023

Coming back here after @jbrockmendel comment #55901 (comment) - thanks for bringing this full circle

OK my current point of view with more context is when we write integral values with to_json we should stick with the "historic precedent" of nanosecond.

Ideally a user writes ISO strings, but when not I think there are too many different ways to interpret this data on a roundtrip, none seemingly better or worse than the other. So in that case would just like to stick with how this has implicitly "worked" for quite some time

@lithomas1
Copy link
Member

Does #53757 help in any way?

(I think I implemented this for to_json but not read_json).

@WillAyd
Copy link
Member

WillAyd commented Dec 3, 2023

I don't think there is a way to add this to read_json without the presence of additional metadata

@lithomas1
Copy link
Member

Ah, what about the ISO string case?

We should be able to infer then, I think.

@WillAyd
Copy link
Member

WillAyd commented Dec 3, 2023

I was only thinking about the integer case outlined in the OP, but that's a good point on the string roundtripping. Looks like ISO 8601 allows decimal fractions that can be used to determine the unit

@jbrockmendel
Copy link
Member Author

Have punted on this by xfailing the relevant test in #55901.

@lithomas1 lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Projects
None yet
Development

No branches or pull requests

4 participants