Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed block context from objToJSON #27166

Closed
wants to merge 38 commits into from

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Jul 1, 2019

Getting one failure locally which will check later but pushing to see on other platforms

@WillAyd
Copy link
Member Author

WillAyd commented Jul 1, 2019

One failing test I see locally has to do with date time precision (change doesn't go down to NS); still debugging

@gfyoung gfyoung added Internals Related to non-user accessible pandas implementation IO JSON read_json, to_json, json_normalize labels Jul 1, 2019
@jbrockmendel
Copy link
Member

If it’s rounding to 6 digits that suggests it is casting to pydatetime somewhere along the way

@WillAyd
Copy link
Member Author

WillAyd commented Jul 2, 2019

Yea so the fix here isn't too far off; the problem is more or less that the prior block context would be responsible for letting the parser know the dtype of objects as they were being converted to JSON. After removal there is some code that infers this but does it in one pass for the entire 2D frame which basically means this gets read in as object dtype, and the default parser precision of ms is what gets written out.

If you write out only one datetime column this works fine, and probably would for a homgenous frame of datetimes, just not when mixed with other types. Solution isn't far off - just need a better way of setting the dtype within the context that an object is getting written out. Stay tuned...

@jreback
Copy link
Contributor

jreback commented Jul 2, 2019

@WillAyd this converts to object before writing things out? tell me this is not so.

@WillAyd
Copy link
Member Author

WillAyd commented Jul 2, 2019

@jreback not quite. Probably best explained via code.

This actually works on this branch:

In [7]: df = pd.DataFrame([[pd.Timestamp('20130101 20:43:42.123456789'),
   ...:     pd.Timestamp('20130101 20:43:42.123456789')]])

In [8]: df.to_json(date_unit='ns')
Out[8]: '{"0":{"0":1357073022123456789},"1":{"0":1357073022123456789}}'

This doesn't

In [9]: df = pd.DataFrame([[pd.Timestamp('20130101 20:43:42.123456789'),
   ...:     'astring']])

In [10]: df.to_json(date_unit='ns')
Out[10]: '{"0":{"0":1357073022123456000},"1":{"0":"astring"}}'

So the problem is that during conversion to JSON there is a call on the 2D object of .to_numpy which sets the dtype of the "context" as objects are being written out. Previously this setting was done at the block level, so this would correctly indicate that you were dealing with date times:

In [7]: df = pd.DataFrame([[pd.Timestamp('20130101 20:43:42.123456789'),
   ...:     pd.Timestamp('20130101 20:43:42.123456789')]])
In [14]: df.to_numpy()
Out[14]:
array([['2013-01-01T20:43:42.123456789', '2013-01-01T20:43:42.123456789']],
      dtype='datetime64[ns]')

Whereas the call with the mixed dtype frame obviously would be object.

So there's just some extra tweaks that need to be made to probably set the dtype for the writing context per column rather than on a 2D object prior to writing out the individual arrays

@jreback
Copy link
Contributor

jreback commented Jul 2, 2019

@WillAyd you answered my question.

So the problem is that during conversion to JSON there is a call on the 2D object of .to_numpy which sets the dtype of the "context" as objects are being written out

why are we operating with a 2-D object at all here? This needs to column by column process (if you don't use blocks). you should never have a 2-D object.

@WillAyd
Copy link
Member Author

WillAyd commented Jul 2, 2019

Yep just an artifact of the old way of doing it I suppose (diff here is practically all pure removal of block manager).

So will have to add a few lines of code here to do the correct column by column. Hoping to get to that over next day or so but even with that added in this should net a significant amount of complexity reduction!

@jreback
Copy link
Contributor

jreback commented Jul 2, 2019

oh no question it reduces complexity
even at the cost of some perf (for wide frames)
but have to be careful with calling .to_numpy(); though i guess for datetime with tz and other EAs that’s unavoidable / fine

just want to do it column by column

@WillAyd
Copy link
Member Author

WillAyd commented Jul 17, 2019

Still WIP and latest PR will cause segfaults but I think I had a breakthrough on how to better implement.

Right now there's a mix of block and data frame usage. What I'm aiming to do is just have well defined serialization rules for DataFrame which would eliminate need for block altogether. This was previously done for SPLIT orientation and I have it done for COLUMNS (which is the default). Just need to get RECORDS and VALUES orients working (table is actually RECORDS behind the scenes)

Also added some documentation to the C code in case anyone else finds useful

@WillAyd
Copy link
Member Author

WillAyd commented Jul 19, 2019

OK all green here so I think the general approach is reviewable. There's still a lot more than can be done but trying to balance review complexity versus change.

Right now some regressions introduced that I'm taking a look at. I'm not surprised by regressions with index because I'm calling iterrows from the Python level. I may change this to transpose first to see how benchmarks look

       before           after         ratio
     [9bab81e0]       [f60a1398]
     <stash~1^2>       <json-remove-blocks>
+         214±8ms       9.64±0.02s    44.99  io.json.ToJSON.time_float_int_str_lines('index')
+         169±1ms       5.71±0.04s    33.71  io.json.ToJSON.time_float_int_lines('index')
+         168±4ms       5.64±0.02s    33.52  io.json.ToJSON.time_floats_with_dt_index_lines('columns')
+         175±8ms       5.63±0.01s    32.16  io.json.ToJSON.time_floats_with_dt_index_lines('index')
+         172±7ms       5.22±0.04s    30.36  io.json.ToJSON.time_floats_with_int_idex_lines('index')
+        98.9±3ms        2.36±0.1s    23.89  io.json.ToJSON.time_delta_int_tstamp('split')
+         136±5ms          220±3ms     1.61  io.json.ToJSON.time_float_int_str('columns')
+         139±5ms         219±20ms     1.58  io.json.ToJSON.time_float_int('columns')
+         131±4ms          193±5ms     1.47  io.json.ToJSON.time_delta_int_tstamp('columns')
+         107±1ms          120±6ms     1.12  io.json.ToJSON.time_float_int_str('split')
-         135±6ms          106±9ms     0.79  io.json.ToJSON.time_floats_with_dt_index('split')
-         125±3ms         95.5±3ms     0.77  io.json.ToJSON.time_floats_with_int_index('split')

int index = GET_TC(tc)->index - 1;

// Also TODO: return a value here rather than having NpyArr_getLabel modify output buf
NpyArr_getLabel(obj, tc, outLen, index, GET_TC(tc)->columnLabels);
Copy link
Member Author

@WillAyd WillAyd Jul 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to grok the use of this function and I find it rather unreadable but leaving for a subsequent PR. With ujson key / value pairs you can override the FOO_iterGetName method to provide the key for a particular object, which would be the idiomatic approach.

I guess as a result of blocks not mapping cleanly to key / value pairs the historical approach was to encode the labels separately and use this dedicated method to write directly to the output buffer rather than returning a char * for ujson to encode into the buffer automatically.

It would take a rather large effort so leaving to a separate PR but would like to use ujson iteration instead of manually encoding all values up front

@jbrockmendel
Copy link
Member

OK all green here so I think the general approach is reviewable. There's still a lot more than can be done but trying to balance review complexity versus change.

Nice! I'm looking forward to seeing this go

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok didn't really see where you are calling iterrows, but you can simply call itertuples if you need for much better perf. what are the current benchmarks?

def setup(self):
N = 10 ** 5
ncols = 5
index = date_range("20000101", periods=N, freq="H")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use params here instead? (you can leave setup like this), but the methods then get the params name

GET_TC(tc)->itemValue = PyObject_GetAttrString(obj, "columns");
} else if (index == 1) {
} else if (index == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these integers referring?

Copy link
Member Author

@WillAyd WillAyd Jul 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SPLIT format has custom iteration to write out (in order):

  • 0: the index
  • 1: the columns
  • 2: the data

This was already in place so here I just wrapped in a conditional statement to have the other orients pass through this context as well. Should certainly refactor to align iteration with other orients (I think better as follow up but will see)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k, yeah odd that these are used here (but sure could have been original)

@WillAyd
Copy link
Member Author

WillAyd commented Jul 20, 2019

Audited the benchmarks and noticed they didn't have complete coverage. Added better in the last PR.

So just to keep things current the latest code I have introduces serious regressions. Those are listed below:

.       before           after         ratio
     [28317f5a]       [8b226a69]
     <master>         <json-remove-blocks>
+        80.9±2ms       11.6±0.03s   143.57  io.json.ToJSON.time_delta_int_tstamp('values')
+         106±8ms       11.7±0.08s   110.04  io.json.ToJSON.time_delta_int_tstamp('records')
+         128±5ms       11.6±0.01s    90.91  io.json.ToJSON.time_delta_int_tstamp('index')
+        137±10ms       9.18±0.04s    67.05  io.json.ToJSON.time_float_int_str('index')
+        77.5±3ms       4.98±0.01s    64.22  io.json.ToJSON.time_floats_with_dt_index('values')
+        87.9±3ms       5.23±0.01s    59.55  io.json.ToJSON.time_floats_with_int_index('records')
+        88.3±5ms       5.12±0.01s    57.95  io.json.ToJSON.time_float_int('values')
+      99.4±0.6ms       5.30±0.01s    53.33  io.json.ToJSON.time_floats_with_int_index('index')
+         114±9ms       5.49±0.02s    48.05  io.json.ToJSON.time_float_int('records')
+         127±6ms       5.89±0.01s    46.53  io.json.ToJSONLines.time_floats_with_int_index_lines
+        90.9±3ms       2.40±0.03s    26.37  io.json.ToJSON.time_delta_int_tstamp('split')
+         108±1ms          164±2ms     1.52  io.json.ToJSON.time_delta_int_tstamp('columns')
+        118±10ms          164±1ms     1.39  io.json.ToJSON.time_float_int_str('columns')
+        91.4±1ms          121±1ms     1.32  io.json.ToJSON.time_floats_with_int_index('columns')
+         135±2ms          172±3ms     1.27  io.json.ToJSON.time_float_int('columns')
+            137M             160M     1.17  io.json.ReadJSONLines.peakmem_read_json_lines_concat('datetime')
+            137M             160M     1.17  io.json.ReadJSONLines.peakmem_read_json_lines_concat('int')
+            174M             198M     1.14  io.json.ReadJSONLines.peakmem_read_json_lines('datetime')
+            174M             198M     1.13  io.json.ReadJSONLines.peakmem_read_json_lines('int')

Most of the bottlenecks here are values / records / index oriented which perhaps not surprisingly are all items that iterate by rows.

This obviously isn't production ready but I think this basis is a lot simpler so will keep updating as I find optimization opportunities to get us back to where we were

@WillAyd
Copy link
Member Author

WillAyd commented Jul 26, 2019

This is getting rather unmanageable to do at once; splitting up in smaller PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

INT: the json C code should not deal with blocks
4 participants