Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_json failing on PyPy #40525

Merged
merged 9 commits into from
Mar 23, 2021
Merged

Conversation

mzeitlin11
Copy link
Member

Not sure how best to test this - confirmed this now works when run on PyPy, but since we don't test against PyPy the examples in the issue (or any other simple case) which fail on PyPy will still pass on master.

Benchmarks look similar, might be minor performance improvement for the single block case:

before           after         ratio
     [b524462e]       [b1f4bcc8]
     <master>         <bug/is_mixed_type>
          194±5ms          190±2ms     0.98  io.json.ToJSONISO.time_iso_format('columns')
          208±5ms          198±6ms     0.95  io.json.ToJSONISO.time_iso_format('index')
          186±4ms          173±3ms     0.93  io.json.ToJSONISO.time_iso_format('records')
         193±10ms          187±3ms     0.97  io.json.ToJSONISO.time_iso_format('split')
          164±4ms          165±3ms     1.00  io.json.ToJSONISO.time_iso_format('values')
          115±4ms          114±3ms     0.99  io.json.ToJSONLines.time_delta_int_tstamp_lines
          142±4ms          132±6ms     0.93  io.json.ToJSONLines.time_float_int_lines
          140±4ms          142±4ms     1.02  io.json.ToJSONLines.time_float_int_str_lines
          152±2ms          153±3ms     1.01  io.json.ToJSONLines.time_float_longint_str_lines
          110±3ms          107±3ms     0.97  io.json.ToJSONLines.time_floats_with_dt_index_lines
          112±3ms          104±3ms     0.93  io.json.ToJSONLines.time_floats_with_int_idex_lines
          54.3M            55M         1.01  io.json.ToJSONMem.peakmem_float
          55.4M               55.2M    1.00  io.json.ToJSONMem.peakmem_int
          88.6±2ms         78.8±2ms    0.89  io.json.ToJSON.time_to_json('index', 'df')
          70.7±2ms         61.7±2ms    0.87  io.json.ToJSON.time_to_json('values', 'df')
          94.9±3ms         81.4±1ms    0.86  io.json.ToJSON.time_to_json('index', 'df_date_idx')
          76.3±0.9ms       65.1±1ms    0.85  io.json.ToJSON.time_to_json('records', 'df_date_idx')

@mzeitlin11 mzeitlin11 added IO JSON read_json, to_json, json_normalize Regression Functionality that used to work in a prior pandas version labels Mar 20, 2021
@jbrockmendel
Copy link
Member

Looks harmless cc @WillAyd

@jreback jreback added this to the 1.2.4 milestone Mar 22, 2021
@jreback
Copy link
Contributor

jreback commented Mar 22, 2021

looks fine and backport ok.

@@ -144,6 +144,7 @@ typedef struct __PyObjectEncoder {
enum PANDAS_FORMAT { SPLIT, RECORDS, INDEX, COLUMNS, VALUES };

int PdBlock_iterNext(JSOBJ, JSONTypeContext *);
static Py_ssize_t get_attr_length(PyObject *obj, char *attr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than forward declaring this can you just swap the position the of get_attr_length and is_simple_frame functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make that change

return 0;
}
int ret = (get_attr_length(mgr, "blocks") <= 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not really sure about this change. The problem with how this code is currently set up is that when is_simple_frame is True then we will just serialize whatever .values is for that DataFrame. However, the rules for extension arrays are a bit different.

Do we have test coverage for DataFrames that only contain 1 block of an extension array? If not I'd be worried this could change things that we aren't testing currently

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation here was to replace the is_mixed_type property the same way it was handled in #40525 (so theoretically this should revert behavior to pre #40525).

Will look into your question and see if that case is covered

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see any coverage, have added some simple tests for the one ExtensionBlock case (which match 1.1.5 behavior). Let me know if this is what you had in mind!

@WillAyd
Copy link
Member

WillAyd commented Mar 23, 2021

Thanks for the updates. Actually reading through the history a little more I understand the length check that you've added - I didn't realize that was the case before in internals

The test isn't really what I was looking for but let's just go ahead and remove that to get this in. For context, the extension arrays I were worried about are likely commented here:

// The special cases to worry about are dt64tz and category[dt64tz].

We've had a rather strange history with .values and what actually gets read in the JSON code. At this point it appears pretty much everything uses that though. If you are interested in contributing more could definitely use help cleaning up the comments and code around that...

But yea for now let's just remove the test and merge this in

@mzeitlin11
Copy link
Member Author

Removed the test. Thanks for pointing that out, I see what you mean now

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm when green

@jreback jreback merged commit 1dab473 into pandas-dev:master Mar 23, 2021
@jreback
Copy link
Contributor

jreback commented Mar 23, 2021

thanks @mzeitlin11

@jreback
Copy link
Contributor

jreback commented Mar 23, 2021

@meeseeksdev backport 1.2.x

@mzeitlin11 mzeitlin11 deleted the bug/is_mixed_type branch March 23, 2021 15:19
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Mar 23, 2021
simonjayhawkins pushed a commit that referenced this pull request Mar 25, 2021
Co-authored-by: Matthew Zeitlin <37011898+mzeitlin11@users.noreply.github.com>
vladu pushed a commit to vladu/pandas that referenced this pull request Apr 5, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: AttributeError: 'BlockManager' object has no attribute 'is_mixed_type' when trying to jsonify dataframe.
4 participants