Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_json not serializing non-nanosecond numpy dt64 correctly #53757

Merged
merged 13 commits into from
Jul 25, 2023

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 added the IO JSON read_json, to_json, json_normalize label Jun 21, 2023
castfunc(dataptr, &i8date, 1, NULL, NULL);
// copied from
// https://github.com/numpy/numpy/blob/c8fe278a754a271af57eaf6c7ffb2382e5a954f9/numpy/core/src/multiarray/datetime.c#L692-L701
dateUnit = ((PyArray_DatetimeDTypeMetaData *)dtype->c_metadata)->meta.base;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is probably the most sketchy part of this PR.

We currently do this the exact same way numpy does it, but not sure if this is part of the public numpy C API.

FWIW, PyArray_DatetimeDTypeMetaData appears in ndarraytypes.h.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if we stuck to a method exposed in the datetime header. Does get_datetime_metadata_from_dtype not help here?

Copy link
Member Author

@lithomas1 lithomas1 Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use that, not sure if that's public either (it's defined in multiarray/_datetime.h).
Like PyArray_DatetimeDTypeMetaData, it also seems to be undocumented. Maybe one of the numpy maintainers(@seberg?) can clarify more?

EDIT: NVM, didn't see we had this internally. Sorry for the ping.

@lithomas1 lithomas1 marked this pull request as ready for review June 21, 2023 03:33
@lithomas1 lithomas1 requested a review from WillAyd as a code owner June 21, 2023 03:33
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice

castfunc(dataptr, &i8date, 1, NULL, NULL);
// copied from
// https://github.com/numpy/numpy/blob/c8fe278a754a271af57eaf6c7ffb2382e5a954f9/numpy/core/src/multiarray/datetime.c#L692-L701
dateUnit = ((PyArray_DatetimeDTypeMetaData *)dtype->c_metadata)->meta.base;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if we stuck to a method exposed in the datetime header. Does get_datetime_metadata_from_dtype not help here?

@jbrockmendel
Copy link
Member

@lithomas1 is there a viable way to do this using values_for_json, i.e. keep this complexity out of the C code?

@lithomas1
Copy link
Member Author

@lithomas1 is there a viable way to do this using values_for_json, i.e. keep this complexity out of the C code?

Long term, we could move to using to_native_types like CSV does, and just directly feed strings to the JSON serializer (for datetime/EA types).

I don't think this PR adds too much complexity (the core changes are just ~3 lines), so maybe we can merge this for now, and then try to pull the logic for anything other than native types (e.g. ints, floats) out of the JSON C code?

@WillAyd
Copy link
Member

WillAyd commented Jun 21, 2023

The challenge with values_for_json is also that to_json has keyword arguments that control the JSON output, e.g. date_format and date_unit. Those are both handled in the serializer currently; values_for_json would have to handle those arguments generically across EAs

@jbrockmendel
Copy link
Member

values_for_json would have to handle those arguments generically across EAs

that sounds like a better long-term solution than doing more here (assuming perf is OK). I'm happy to defer to you two on this PR in the interim.

@@ -1293,7 +1301,8 @@ char **NpyArr_encodeLabels(PyArrayObject *labels, PyObjectEncoder *enc,
}

int is_datetimelike = 0;
npy_int64 nanosecVal;
npy_int64 i8date;
NPY_DATETIMEUNIT dateUnit = NPY_FR_ns;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here that this should default to NS for things that are stored within object arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for the other stuff like timedeltas and dates. We use nanosecond reso there.

I didn't try to fix those cases as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think it would take to handle those consistently? I don't know how common this would happen but I guess its strange to only do this for numpy-typed arrays and not object arrays that may contain datetimes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

knee-jerk i think that for object-dtype we should just call str(x) on everything and call it a day

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The object stuff definitely adds a lot of complexity that may not be worth it, but that would be a breaking change to just pass as a str. Particularly for nested containers within an object array that would not work

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might work for object arrays, already.

I haven't checked yet, I'm not on the right branch. Will update later in the day.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it turns out that this didn't work already. Sorry for the confusion.

I've fixed this in the latest commit.

PyErr_SetString(PyExc_TypeError, "Expected date object");
if (!PyDate_Check(obj) && !PyDateTime_Check(obj)) {
PyErr_SetString(PyExc_TypeError, "Expected date or datetime object");
((JSONObjectEncoder *)tc->encoder)->errorMsg = "";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-topic, but we should be setting the errorMsg whenever we return NULL in a callback.

Otherwise, there is a possibility of a segfault, since ujson will still think it's in a valid state and try to continue encoding, instead of instantly returning (allowing the Python exception to propagate up).

if (((PyDatetimeScalarObject *)obj)->obval == get_nat()) {
tc->type = JT_NULL;
return;
}
PyArray_Descr *dtype = PyArray_DescrFromScalar(obj);
if (dtype->type_num == NPY_OBJECT) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this can be removed. This shouldn't ever fail, but if it does, we'll end up with a nasty segfault.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://numpy.org/doc/stable/reference/c-api/array.html#c.PyArray_DescrFromScalar

Reading through that makes me think we are supposed to check we are dealing with an array scalar before even calling this function,

As far as the resulting comparison to NPY_OBJECT, I think would be better to explicitly use the PyTypeNum_ISDATETIME macro and error when that is not true; that would be a better and more explicit safeguard against troublesome refactors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this whole block is in a PyArray_IsScalar if block.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comments but OK with merging whenever you feel complete

if (((PyDatetimeScalarObject *)obj)->obval == get_nat()) {
tc->type = JT_NULL;
return;
}
PyArray_Descr *dtype = PyArray_DescrFromScalar(obj);
if (dtype->type_num == NPY_OBJECT) {
PyErr_Format(PyExc_ValueError, "Could not get resolution of datetime");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we aren't returning here? Feels a bit off to let execution continue after setting error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, updated.

@lithomas1 lithomas1 merged commit ee5cf2c into pandas-dev:main Jul 25, 2023
@lithomas1 lithomas1 deleted the json-non-nano branch July 25, 2023 22:07
@lithomas1
Copy link
Member Author

gonna put this in then.

Thanks for the reviews all.

@mroeschke mroeschke added this to the 2.1 milestone Jul 25, 2023
jamie-harness pushed a commit to jamie-harness/pandas1 that referenced this pull request Jul 26, 2023
…das-dev#53757)

* BUG: to_json not serializing non-nanosecond numpy dt64 correctly

* fix tests

* change extraction mech

* fix object array case

* pre-commit

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: to_json returns incorrect timestamps if DatetimeIndex precision is not ns
4 participants