Removed block context from objToJSON #27166

WillAyd · 2019-07-01T15:01:15Z

closes INT: the json C code should not deal with blocks #27164
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Getting one failure locally which will check later but pushing to see on other platforms

WillAyd · 2019-07-01T15:04:50Z

One failing test I see locally has to do with date time precision (change doesn't go down to NS); still debugging

jbrockmendel · 2019-07-02T00:48:50Z

If it’s rounding to 6 digits that suggests it is casting to pydatetime somewhere along the way

WillAyd · 2019-07-02T00:54:47Z

Yea so the fix here isn't too far off; the problem is more or less that the prior block context would be responsible for letting the parser know the dtype of objects as they were being converted to JSON. After removal there is some code that infers this but does it in one pass for the entire 2D frame which basically means this gets read in as object dtype, and the default parser precision of ms is what gets written out.

If you write out only one datetime column this works fine, and probably would for a homgenous frame of datetimes, just not when mixed with other types. Solution isn't far off - just need a better way of setting the dtype within the context that an object is getting written out. Stay tuned...

jreback · 2019-07-02T00:56:27Z

@WillAyd this converts to object before writing things out? tell me this is not so.

WillAyd · 2019-07-02T01:02:55Z

@jreback not quite. Probably best explained via code.

This actually works on this branch:

In [7]: df = pd.DataFrame([[pd.Timestamp('20130101 20:43:42.123456789'),
   ...:     pd.Timestamp('20130101 20:43:42.123456789')]])

In [8]: df.to_json(date_unit='ns')
Out[8]: '{"0":{"0":1357073022123456789},"1":{"0":1357073022123456789}}'

This doesn't

In [9]: df = pd.DataFrame([[pd.Timestamp('20130101 20:43:42.123456789'),
   ...:     'astring']])

In [10]: df.to_json(date_unit='ns')
Out[10]: '{"0":{"0":1357073022123456000},"1":{"0":"astring"}}'

So the problem is that during conversion to JSON there is a call on the 2D object of .to_numpy which sets the dtype of the "context" as objects are being written out. Previously this setting was done at the block level, so this would correctly indicate that you were dealing with date times:

In [7]: df = pd.DataFrame([[pd.Timestamp('20130101 20:43:42.123456789'),
   ...:     pd.Timestamp('20130101 20:43:42.123456789')]])
In [14]: df.to_numpy()
Out[14]:
array([['2013-01-01T20:43:42.123456789', '2013-01-01T20:43:42.123456789']],
      dtype='datetime64[ns]')

Whereas the call with the mixed dtype frame obviously would be object.

So there's just some extra tweaks that need to be made to probably set the dtype for the writing context per column rather than on a 2D object prior to writing out the individual arrays

jreback · 2019-07-02T01:07:49Z

@WillAyd you answered my question.

So the problem is that during conversion to JSON there is a call on the 2D object of .to_numpy which sets the dtype of the "context" as objects are being written out

why are we operating with a 2-D object at all here? This needs to column by column process (if you don't use blocks). you should never have a 2-D object.

WillAyd · 2019-07-02T01:10:17Z

Yep just an artifact of the old way of doing it I suppose (diff here is practically all pure removal of block manager).

So will have to add a few lines of code here to do the correct column by column. Hoping to get to that over next day or so but even with that added in this should net a significant amount of complexity reduction!

jreback · 2019-07-02T01:15:39Z

oh no question it reduces complexity
even at the cost of some perf (for wide frames)
but have to be careful with calling .to_numpy(); though i guess for datetime with tz and other EAs that’s unavoidable / fine

just want to do it column by column

WillAyd · 2019-07-17T00:13:16Z

Still WIP and latest PR will cause segfaults but I think I had a breakthrough on how to better implement.

Right now there's a mix of block and data frame usage. What I'm aiming to do is just have well defined serialization rules for DataFrame which would eliminate need for block altogether. This was previously done for SPLIT orientation and I have it done for COLUMNS (which is the default). Just need to get RECORDS and VALUES orients working (table is actually RECORDS behind the scenes)

Also added some documentation to the C code in case anyone else finds useful

WillAyd · 2019-07-19T23:37:26Z

OK all green here so I think the general approach is reviewable. There's still a lot more than can be done but trying to balance review complexity versus change.

Right now some regressions introduced that I'm taking a look at. I'm not surprised by regressions with index because I'm calling iterrows from the Python level. I may change this to transpose first to see how benchmarks look

       before           after         ratio
     [9bab81e0]       [f60a1398]
     <stash~1^2>       <json-remove-blocks>
+         214±8ms       9.64±0.02s    44.99  io.json.ToJSON.time_float_int_str_lines('index')
+         169±1ms       5.71±0.04s    33.71  io.json.ToJSON.time_float_int_lines('index')
+         168±4ms       5.64±0.02s    33.52  io.json.ToJSON.time_floats_with_dt_index_lines('columns')
+         175±8ms       5.63±0.01s    32.16  io.json.ToJSON.time_floats_with_dt_index_lines('index')
+         172±7ms       5.22±0.04s    30.36  io.json.ToJSON.time_floats_with_int_idex_lines('index')
+        98.9±3ms        2.36±0.1s    23.89  io.json.ToJSON.time_delta_int_tstamp('split')
+         136±5ms          220±3ms     1.61  io.json.ToJSON.time_float_int_str('columns')
+         139±5ms         219±20ms     1.58  io.json.ToJSON.time_float_int('columns')
+         131±4ms          193±5ms     1.47  io.json.ToJSON.time_delta_int_tstamp('columns')
+         107±1ms          120±6ms     1.12  io.json.ToJSON.time_float_int_str('split')
-         135±6ms          106±9ms     0.79  io.json.ToJSON.time_floats_with_dt_index('split')
-         125±3ms         95.5±3ms     0.77  io.json.ToJSON.time_floats_with_int_index('split')

WillAyd · 2019-07-19T23:40:30Z

pandas/_libs/src/ujson/python/objToJSON.c

+    int index = GET_TC(tc)->index - 1;
+
+    // Also TODO: return a value here rather than having NpyArr_getLabel modify output buf
+    NpyArr_getLabel(obj, tc, outLen, index, GET_TC(tc)->columnLabels);


It took me a while to grok the use of this function and I find it rather unreadable but leaving for a subsequent PR. With ujson key / value pairs you can override the FOO_iterGetName method to provide the key for a particular object, which would be the idiomatic approach.

I guess as a result of blocks not mapping cleanly to key / value pairs the historical approach was to encode the labels separately and use this dedicated method to write directly to the output buffer rather than returning a char * for ujson to encode into the buffer automatically.

It would take a rather large effort so leaving to a separate PR but would like to use ujson iteration instead of manually encoding all values up front

jbrockmendel · 2019-07-19T23:45:30Z

OK all green here so I think the general approach is reviewable. There's still a lot more than can be done but trying to balance review complexity versus change.

Nice! I'm looking forward to seeing this go

jreback

ok didn't really see where you are calling iterrows, but you can simply call itertuples if you need for much better perf. what are the current benchmarks?

jreback · 2019-07-20T20:38:02Z

asv_bench/benchmarks/io/json.py

+    def setup(self):
+        N = 10 ** 5
+        ncols = 5
+        index = date_range("20000101", periods=N, freq="H")


can you use params here instead? (you can leave setup like this), but the methods then get the params name

jreback · 2019-07-20T20:39:47Z

pandas/_libs/src/ujson/python/objToJSON.c

        GET_TC(tc)->itemValue = PyObject_GetAttrString(obj, "columns");
-    } else if (index == 1) {
+      } else if (index == 1) {


what are these integers referring?

The SPLIT format has custom iteration to write out (in order):

0: the index

1: the columns

2: the data

This was already in place so here I just wrapped in a conditional statement to have the other orients pass through this context as well. Should certainly refactor to align iteration with other orients (I think better as follow up but will see)

k, yeah odd that these are used here (but sure could have been original)

WillAyd · 2019-07-20T20:43:14Z

Audited the benchmarks and noticed they didn't have complete coverage. Added better in the last PR.

So just to keep things current the latest code I have introduces serious regressions. Those are listed below:

.       before           after         ratio
     [28317f5a]       [8b226a69]
     <master>         <json-remove-blocks>
+        80.9±2ms       11.6±0.03s   143.57  io.json.ToJSON.time_delta_int_tstamp('values')
+         106±8ms       11.7±0.08s   110.04  io.json.ToJSON.time_delta_int_tstamp('records')
+         128±5ms       11.6±0.01s    90.91  io.json.ToJSON.time_delta_int_tstamp('index')
+        137±10ms       9.18±0.04s    67.05  io.json.ToJSON.time_float_int_str('index')
+        77.5±3ms       4.98±0.01s    64.22  io.json.ToJSON.time_floats_with_dt_index('values')
+        87.9±3ms       5.23±0.01s    59.55  io.json.ToJSON.time_floats_with_int_index('records')
+        88.3±5ms       5.12±0.01s    57.95  io.json.ToJSON.time_float_int('values')
+      99.4±0.6ms       5.30±0.01s    53.33  io.json.ToJSON.time_floats_with_int_index('index')
+         114±9ms       5.49±0.02s    48.05  io.json.ToJSON.time_float_int('records')
+         127±6ms       5.89±0.01s    46.53  io.json.ToJSONLines.time_floats_with_int_index_lines
+        90.9±3ms       2.40±0.03s    26.37  io.json.ToJSON.time_delta_int_tstamp('split')
+         108±1ms          164±2ms     1.52  io.json.ToJSON.time_delta_int_tstamp('columns')
+        118±10ms          164±1ms     1.39  io.json.ToJSON.time_float_int_str('columns')
+        91.4±1ms          121±1ms     1.32  io.json.ToJSON.time_floats_with_int_index('columns')
+         135±2ms          172±3ms     1.27  io.json.ToJSON.time_float_int('columns')
+            137M             160M     1.17  io.json.ReadJSONLines.peakmem_read_json_lines_concat('datetime')
+            137M             160M     1.17  io.json.ReadJSONLines.peakmem_read_json_lines_concat('int')
+            174M             198M     1.14  io.json.ReadJSONLines.peakmem_read_json_lines('datetime')
+            174M             198M     1.13  io.json.ReadJSONLines.peakmem_read_json_lines('int')

Most of the bottlenecks here are values / records / index oriented which perhaps not surprisingly are all items that iterate by rows.

This obviously isn't production ready but I think this basis is a lot simpler so will keep updating as I find optimization opportunities to get us back to where we were

WillAyd · 2019-07-26T00:53:00Z

This is getting rather unmanageable to do at once; splitting up in smaller PRs

Removed block context from objToJSON

9cfdefd

gfyoung added Internals Related to non-user accessible pandas implementation IO JSON read_json, to_json, json_normalize labels Jul 1, 2019

WillAyd added 10 commits July 3, 2019 09:05

Moved some block stuff around

6f90e45

Added frame context and stubbed iteration

0a01ada

checpoint

cb820d1

Merge remote-tracking branch 'upstream/master' into json-remove-blocks

061b0c6

Compiled though broken :-X

fae5c56

working serialization

0ab46a4

Working basic example

4eabc63

Fixed not sizing output buffer

c4a6661

some docstrings

787c3bd

Set appropriate type context

3a6a13b

WillAyd added 8 commits July 17, 2019 10:25

Working VALUES iteration

615c104

Working for all formats, save state mutation

1af960d

Fixed segfaults and infinite loops with SPLIT

7eb25f3

Fixed cStr handling

92cc532

more cleanups of c string conversions

296b494

Removed print statements

f67ecf9

Merge remote-tracking branch 'upstream/master' into json-remove-blocks

13ca6c1

Removed breakpoint

466380c

WillAyd added 16 commits July 17, 2019 14:10

Merge remote-tracking branch 'upstream/master' into json-remove-blocks

1b2b85b

Fixed issue with CategoricalIndex

9c7b2e1

Removed prints

a6da784

Merge remote-tracking branch 'upstream/master' into json-remove-blocks

f6e8308

Going route of chaging NpyArr_encodeLabels

f6726ab

Hack to get labels working

48d5a26

comments

96476dd

Working implementation for almost all cases

655d96e

Removed errant whitespace

2c29a6f

Removed unnused doc param

fb50c43

Removed PdFrameContext

a742b72

Removed unused transpose struct member

8003087

Remove more block code

8218515

Fixed issue with label encoding

04cb508

Whitespace cleanup

a92c96b

Wide frame benchmark

f60a139

Fixed new benchmark

1de936f

WillAyd commented Jul 19, 2019

View reviewed changes

More comprehensive benchmarks

8b226a6

jreback requested changes Jul 20, 2019

View reviewed changes

Passed VALUES directly through to numpy

82739be

WillAyd mentioned this pull request Jul 25, 2019

Expanded ASVs for to_json #27595

Merged

WillAyd closed this Jul 26, 2019

WillAyd deleted the json-remove-blocks branch January 16, 2020 00:35

jorisvandenbossche mentioned this pull request Sep 5, 2020

INT: the json C code should not deal with blocks #27164

Closed

jorisvandenbossche mentioned this pull request Apr 21, 2021

REF: remove Block access in the JSON writing code #41081

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed block context from objToJSON #27166

Removed block context from objToJSON #27166

WillAyd commented Jul 1, 2019

WillAyd commented Jul 1, 2019

jbrockmendel commented Jul 2, 2019

WillAyd commented Jul 2, 2019 •

edited

Loading

jreback commented Jul 2, 2019

WillAyd commented Jul 2, 2019 •

edited

Loading

jreback commented Jul 2, 2019

WillAyd commented Jul 2, 2019

jreback commented Jul 2, 2019

WillAyd commented Jul 17, 2019

WillAyd commented Jul 19, 2019

WillAyd Jul 19, 2019 •

edited

Loading

jbrockmendel commented Jul 19, 2019

jreback left a comment

jreback Jul 20, 2019

jreback Jul 20, 2019

WillAyd Jul 20, 2019 •

edited

Loading

jreback Jul 20, 2019

WillAyd commented Jul 20, 2019

WillAyd commented Jul 26, 2019

Removed block context from objToJSON #27166

Removed block context from objToJSON #27166

Conversation

WillAyd commented Jul 1, 2019

WillAyd commented Jul 1, 2019

jbrockmendel commented Jul 2, 2019

WillAyd commented Jul 2, 2019 • edited Loading

jreback commented Jul 2, 2019

WillAyd commented Jul 2, 2019 • edited Loading

jreback commented Jul 2, 2019

WillAyd commented Jul 2, 2019

jreback commented Jul 2, 2019

WillAyd commented Jul 17, 2019

WillAyd commented Jul 19, 2019

WillAyd Jul 19, 2019 • edited Loading

Choose a reason for hiding this comment

jbrockmendel commented Jul 19, 2019

jreback left a comment

Choose a reason for hiding this comment

jreback Jul 20, 2019

Choose a reason for hiding this comment

jreback Jul 20, 2019

Choose a reason for hiding this comment

WillAyd Jul 20, 2019 • edited Loading

Choose a reason for hiding this comment

jreback Jul 20, 2019

Choose a reason for hiding this comment

WillAyd commented Jul 20, 2019

WillAyd commented Jul 26, 2019

WillAyd commented Jul 2, 2019 •

edited

Loading

WillAyd commented Jul 2, 2019 •

edited

Loading

WillAyd Jul 19, 2019 •

edited

Loading

WillAyd Jul 20, 2019 •

edited

Loading