ENH: Add JSON export option for DataFrame #631 #1226

Komnomnomnom · 2012-05-11T16:03:03Z

No description provided.

Bundle custom ujson lib for DataFrame and Series JSON export & import.

takluyver · 2012-05-11T16:46:17Z

I don't think we should be bundling a json encoder. There's a JSON module in Python since 2.6, and it's simple enough to install other implementations if the user needs e.g. more speed. Let's just have a little shim module that will try to import JSON APIs in order of preference.

Komnomnomnom · 2012-05-11T17:58:40Z

@takluyver there's a bit of a discussion already at #631, not sure if you're aware of it. I should have added more info in the description though, sorry. The main motivation for including this fork of ujson in pandas is it specifically works with pandas datatypes at a very low level (it is pure C) so it wouldn't be of any benefit to non-pandas users. If a user wants to use their own favourite JSON decoder they would obviously still be free to do so.

However I'll admit that high performance JSON serialisation is probably a minor requirement for most people's needs so I'm happy either way.

takluyver · 2012-05-11T18:47:34Z

Thanks, I wasn't aware of that. I'm still not wild on the approach - it seems like it will make for a heavier library and a bigger codebase to maintain. But Wes seems to be happy with the idea, so you don't have to worry about my objections ;-)

A couple of practical questions:

Your README has a lot of benchmarks, but I haven't taken the time to work out what they all mean. Can you summarise: what sort of improvement do we see from forking ujson, versus the best we could do with a stock build?

What sort of workloads do we envisage - is the bottleneck when you have one huge dataframe, or thousands of smaller ones?

Assuming ujson is still actively developed, how important and how easy will it be to get updates from upstream in the future?

Komnomnomnom · 2012-05-11T20:23:50Z

When working with numpy types:

encoding : no real advantage, sometimes even a disadvantage, the numpy to list conversion is very efficient.
decoding : about 1.5 to 2x the speed (when working with numeric types).

DataFrames:

encoding : depending on the desired format and the nature (shape, size) of the input a speedup of about 2x to 10x. Although there's cases where it's about 20x (e.g. 200x4 zeros).
decoding : again depending on the encoded format a speedup of about 2x to 3x is typical, but can be up to 20x.
for time series data encoding & decoding is usually better, or on a par with, encoding the corresponding Python basic type (i.e. a dict). For time series data with datetime indices I'm seeing about a 7x speedup for encoding DataFrames and about 3x for decoding. In the best case, where a transpose would otherwise be necessary, the speedup is about 15 to 20x.

And this is on top of ujson already being one of the speediest JSON libraries.

My specific use case is the need to share lots of Dataframes between Python processes (and other languages) with a mix of sizes. JSON was the natural choice for us because of portability, and we wanted to get the best performance out of it.

ujson is a relatively small and stable library. There has only been some minor patches in the last few months and the author seems pretty open to pull requests etc. I'll be merging any applicable upstream changes to my fork and I'd be happy to do the same for pandas if it ends up being integrated. I'm pretty familiar with the ujson code now (it's really only four files) and I'd likewise be happy to deal with any bugs / enhancements coming from pandas usage too. It's worth noting that the library is split in two parts, one being the language agnostic JSON encoder / decoder and the other being the Python bindings. I managed to keep the bulk of my changes limited to the Python bindings and even then they are new functions / new code rather than changes to existing functions. My point being upstream changes should be easy enough to merge.

takluyver · 2012-05-11T20:42:57Z

Thanks, that all sounds pretty reasonable, and I'm satisfied that this is worth doing.

wesm · 2012-05-12T14:46:30Z

This is really excellent work, thanks so much for doing this. Yeah, I was initially a bit hesitant to bundle ujson, but given that more and more people want to do JS<->pandas integration, getting the best possible encoding/decoding performance and being able to access the NumPy arrays directly in the C encoder makes a lot of sense. We'll have to periodically pull in upstream changes from ujson, I guess.

vgoklani · 2012-05-12T16:47:44Z

just curious, how would this handle nested JSON? i.e.

j = {'person' : {'first_name' : 'Albert', 'last_name' : 'Einstein', 'occupation': {'job_title': 'Theoretical Physicist', 'institution' : 'Princeton University', 'accomplishments':['Brownian motion', 'Special Relativity', 'General Relativity']}}}

df = pandas.DataFrame(j)

df = ?

Komnomnomnom · 2012-05-12T17:08:18Z

From a performance standpoint not very well I'm afraid, the numpy with labels handling bombs out if it detects more than two levels of nesting. It probably could be tweaked to deal with this better but when decoding with complex types (i.e. objects and strings) a Python list is needed as an intermediary anyway, so I'm not sure there'd be any advantage.

The good news is the methods in DataFrame and Series fall back to standard decoding if the numpy version fails so it should still work as expected, albeit without the performance improvements.

Just tested it out to make sure

In [1]: from pandas import DataFrame
In [2]: j = {'person' : {'first_name' : 'Albert', 'last_name' : 'Einstein', 'occupation': {'job_title': 'Theoretical Physicist', 'institution' : 'Princeton University', 'accomplishments':['Brownian motion', 'Special Relativity', 'General Relativity']}}}

In [3]: df = DataFrame(j)

In [4]: df
Out[4]: 
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, first_name to occupation
Data columns:
person    3  non-null values
dtypes: object(1)

In [5]: df['person']['occupation']
Out[5]: 
{'accomplishments': ['Brownian motion',
  'Special Relativity',
  'General Relativity'],
 'institution': 'Princeton University',
 'job_title': 'Theoretical Physicist'}

In [6]: df.to_json()
Out[6]: '{"person":{"first_name":"Albert","last_name":"Einstein","occupation":{"accomplishments":["Brownian motion","Special Relativity","General Relativity"],"institution":"Princeton University","job_title":"Theoretical Physicist"}}}'

In [7]: json = df.to_json()

In [8]: DataFrame.from_json(json)
Out[8]: 
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, first_name to occupation
Data columns:
person    3  non-null values
dtypes: object(1)

In [9]: DataFrame.from_json(json)['person']['occupation']
Out[9]: 
{u'accomplishments': [u'Brownian motion',
  u'Special Relativity',
  u'General Relativity'],
 u'institution': u'Princeton University',
 u'job_title': u'Theoretical Physicist'}

Edit: I should have mentioned the comments above are related to decoding only. Encoding does not suffer the same issues and the performance improvements still apply.

wesm · 2012-05-19T17:40:54Z

Hey @Komnomnomnom I started to see if I can merge this and am getting a segfault on my system (Python 2.7.2, NumPy 1.6.1, 64-bit Ubuntu).

The object returned by series.to_json(orient='columns') in _check_orient(series, "columns", dtype=dtype) from test_series.py, line 341, appears to be NULL (the gdb backtrace showed the segfault in from_json, but the data returned by to_json is malformed):

test_from_json_to_json (__main__.TestSeries) ... > /home/wesm/code/pandas/pandas/tests/test_series.py(324)_check_orient()
-> foo
(Pdb) type(series.to_json(orient=orient))
Segmentation fault

I can probably track down the problem, but I figure since you wrote the C code that you'd be more able if you can reproduce the error.

Komnomnomnom · 2012-05-19T17:55:53Z

Hi Wes, I just tried with my local clone of my fork and had no segmentation fault (all tests passed when I made my commit / pull request). I'll merge in the latest from pandas master and see what happens.

For the record I'm using Pyton 2.7.2, numpy 1.6.1 on 64-bit OSX.

wesm · 2012-05-19T18:07:33Z

I put in print statements

  printf("%s\n", ret);

  printf("length: %d\n", strlen(ret));

and here's the output

{"2F4SMHsw4I":-1.4303216796,"nMi4KBCmg7":-1.32552412,"Molf5Ue3kF":-1.2705465829,"9kkHHlfXPA":-0.8877964843,"6E3ma1UHv7":-0.850191537,"2F5JdoFIqQ":-0.8013936673,"VzJclGGLsr":-0.7985248155,"cI4bkkV9MH":-0.7000873004,"TxS6mJ8UuP":-0.6864885751,"2jGSZe0rmF":-0.6708315768,"oHooxHeHqu":-0.6482430589,"HuqOm1mf57":-0.624890804,"bEWcPipOk9":-0.5669391204,"zpy7FQCGgp":-0.3383151716,"nYIL8VPVT3":-0.2663003599,"x0YmXOvJ49":-0.1767082308,"bJm3Pbjx14":-0.1510545428,"E51nrgW9Yt":0.0101299091,"QycwIANnTx":0.1575097137,"8wVdQ8RIdQ":0.2073634038,"90c5KPKyeS":0.2539122603,"eERFnAAd8k":0.3728367,"tZLEG6seKV":0.4332938883,"ehdTUcPK7A":0.457039038,"biYpVDeFiz":0.5021518808,"JlVXVA62Zz":0.5918523437,"2UTfjHGMEy":0.6413052158,"5VOyIV1TYs":0.6828158342,"WyNfVlEOK3":1.1809723971,"YrW1NS7fCX":1.3862224711}
length: 790

pandas/src/ujson/python/objToJSON.c: MARK(1490)
Segmentation fault

Somehow the result of PyString_FromString is malformed, it seems like maybe ret is not null-terminated? I suspect this is a red herring, though

wesm · 2012-05-19T18:11:27Z

It looks like something is getting corrupted:

14:09 ~/code/pandas  (json-export)$ python pandas/tests/test_ujson.py 
nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
testArrayNumpyExcept (__main__.NumpyJSONTests) ... ok
testArrayNumpyLabelled (__main__.NumpyJSONTests) ... ok
testArrays (__main__.NumpyJSONTests) ... ok
testBool (__main__.NumpyJSONTests) ... ok
testBoolArray (__main__.NumpyJSONTests) ... ok
testFloat (__main__.NumpyJSONTests) ... ok
testFloatArray (__main__.NumpyJSONTests) ... ok
testFloatMax (__main__.NumpyJSONTests) ... ok
testInt (__main__.NumpyJSONTests) ... ok
testIntArray (__main__.NumpyJSONTests) ... ok
testIntMax (__main__.NumpyJSONTests) ... ok
testDataFrame (__main__.PandasJSONTests) ... > /home/wesm/code/pandas/pandas/tests/test_ujson.py(943)testDataFrame()
-> foo
(Pdb) u
> /home/wesm/epd/lib/python2.7/unittest/case.py(327)run()
-> testMethod()
(Pdb) d
> /home/wesm/code/pandas/pandas/tests/test_ujson.py(943)testDataFrame()
-> foo
(Pdb) l
938     class PandasJSONTests(TestCase):
939     
940         def testDataFrame(self):
941             df = DataFrame([[1,2,3], [4,5,6]], index=['a', 'b'], columns=['x', 'y', 'z'])
942     
943  ->         foo
944             # column indexed
945             outp = DataFrame(ujson.decode(ujson.encode(df)))
946             self.assertTrue((df == outp).values.all())
947             assert_array_equal(df.columns, outp.columns)
948             assert_array_equal(df.index, outp.index)
(Pdb) ujson.encode(df)
'{"x":{"a":1,"b":4},"y":{"a":2,"b":5},"z":{"a":3,"b":6}}'
(Pdb) print df
Segmentation fault

wesm · 2012-05-19T18:35:09Z

It looks like the culprit must be NpyArr_encodeLabels. I'm not enough of a C guru to see what might be going wrong-- everything works here except encoding Series/DataFrame, and inside there is plenty of twiddling of bytes. Let me know if you manage to figure it out =/

Komnomnomnom · 2012-05-19T18:40:26Z

Hmm I've merged to the latest on pandas master, seeing some failed tests but still no segmentation faults, no corruption and those print statements work fine. I'm going to try in a Ubuntu VM see if I can get to the bottom of it.

…algos extension

Pandas was using some of the enums and structures exposed by its headers. By creating its own local copies of these, it is possible to allow the numpy ABI to be improved while in its experimental state.

…lity

…meters

As mentioned in #1098.

Komnomnomnom · 2012-05-20T08:23:56Z

Ugh, I did not know merging into my fork would flood this pull request. It might be best to delete my current fork and submit a new pull request once this issue is sorted.

The good news is after a bit of setup I was able to reproduce the memory corruption you are seeing in my Ubuntu VM. It appears to happen even when NpyArr_encodeLabels is not involved. There is also some weirdness with timestamp conversion but I think that is a separate issue.

Komnomnomnom · 2012-05-20T12:22:41Z

I believe I've found the problem. The reference count of the object being encoded was mistakenly being decremented twice. I presume it was just chance that the memory layout or garbage collection schedule on my laptop meant the object wasn't being deleted.

There are a few more things I've noticed (like build clean deleting the C files and datetime conversion is now not working) which I'll fix before submitting a new pull request. I'll close this one for now and I'll create a feature branch on a new fork to avoid this mess happening again.

wesm · 2012-05-20T15:15:45Z

That will teach you not to develop in master ;) BTW, you don't need to refork-- you can git reset --hard upstream/master and force-push that to github. Just make sure you make a branch of your current master with the JSON work

Komnomnomnom · 2012-05-20T15:23:27Z

Ooop too late I re-forked a few minutes ago....hope this doesn't cause further problems... :-/

BTW if you want to test the fix on your machine the offending line was 278 in NpyArr_iterEnd
cb7c6ae#L6R279
(That line should be removed.)

Also I'm still noticing some timestamp weirdness, I'm guessing there were changes recently in master regarding datetime64 ? Is this work still ongoing?

wesm · 2012-05-20T15:25:38Z

Yes, the work is still ongoing. Test failures in JSON encoding/decoding or elsewhere (pydata/master test suite passes cleanly for me)? I should be able to fix them myself

jreback · 2013-06-11T19:20:01Z

implemented via #3804

ENH: Add JSON export option for DataFrame #631

cb7c6ae

Bundle custom ujson lib for DataFrame and Series JSON export & import.

wesm and others added 15 commits May 19, 2012 20:38

REF: working toward #1150, broke apart Cython module into generated _…

3af585e

…algos extension

REF: have got things mostly working for #1150

11f2c0d

BUG: more bug fixes, have to fix intraday frequencies still

e9dee69

BUG: more intraday unit fixes

69d0baa

BUG: test suite passes, though negative ordinals broken

5485c2d

BUG: weekly and business daily unit support #1150

879779d

REF: remove period multipliers, close #1199

85fcd69

ENH: move _ensure_{dtype} functions to Cython for speedup, close #1221

075f05e

DOC: doc fixes

ee73df1

ENH: handle dict return values and vbench, close #823

9e88e0c

ENH: add is_full method to PeriodIndex close #1114

a31ed38

Remove dependencies on details of experimental numpy datetime64 ABI

b457ff8

Pandas was using some of the enums and structures exposed by its headers. By creating its own local copies of these, it is possible to allow the numpy ABI to be improved while in its experimental state.

Use datetime64 with a 'us' unit explicitly, for 1.6 and 1.7 compatibi…

3d83387

…lity

Use an explicit unit for the 1.7 datetime64 scalar constructor

c53e093

Use assert_equal instead of assert, to see the actual values

89bd898

invisibleroads and others added 22 commits May 19, 2012 20:39

Fixed a few typos

b60c0d3

TST: rephrased .keys call for py3compat

49ad7e2

DOC: put back doc regarding inplace in rename in anticipation of feature

421f5d3

DOC: reworded description for MultiIndex

181f945

DOC: started on timeseries.rst for 0.8

fb1e662

REF: microsecond -> nanosecond migration, most of the way there #1238

d4407a9

BUG: more nano fixes

4f15d54

REF: more nanosecond support fixes, test suite passes #1238

9bc3814

ENH: more nanosecond support #1238

b026566

Changes to plotting scatter matrix diagonals

c360391

Changed xtick, ytick labels

cf74512

Added simple test cases

d7d6a0f

Updated plotting.py scatter_matrix docstring to describe all the para…

cd8222c

…meters

Added scatter_matrix examples to visualization.rst

8e2f3f9

DOC: release notes

da1b234

BUG: DataFrame.drop_duplicates with NA values

a6e32b8

use fast zip with a placeholder value just for np.nan

2a6fc11

TST: vbench for drop_duplicate with skipna set to False

d95a254

optimized a little bit for speed

7953ae8

ENH: inplace option to DataFrame.drop_duplicates #805 with vbench

916be1d

BUG: replace complex64 with complex128

ba6a9c8

As mentioned in #1098.

ENH: add KDE plot from #1059

1cacb6c

Komnomnomnom closed this May 20, 2012

Komnomnomnom mentioned this pull request May 21, 2012

ENH: Add JSON export option for DataFrame (take 2) #1263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add JSON export option for DataFrame #631 #1226

ENH: Add JSON export option for DataFrame #631 #1226

Komnomnomnom commented May 11, 2012

takluyver commented May 11, 2012

Komnomnomnom commented May 11, 2012

takluyver commented May 11, 2012

Komnomnomnom commented May 11, 2012

takluyver commented May 11, 2012

wesm commented May 12, 2012

vgoklani commented May 12, 2012

Komnomnomnom commented May 12, 2012

wesm commented May 19, 2012

Komnomnomnom commented May 19, 2012

wesm commented May 19, 2012

wesm commented May 19, 2012

wesm commented May 19, 2012

Komnomnomnom commented May 19, 2012

Komnomnomnom commented May 20, 2012

Komnomnomnom commented May 20, 2012

wesm commented May 20, 2012

Komnomnomnom commented May 20, 2012

wesm commented May 20, 2012

jreback commented Jun 11, 2013

ENH: Add JSON export option for DataFrame #631 #1226

ENH: Add JSON export option for DataFrame #631 #1226

Conversation

Komnomnomnom commented May 11, 2012

takluyver commented May 11, 2012

Komnomnomnom commented May 11, 2012

takluyver commented May 11, 2012

Komnomnomnom commented May 11, 2012

takluyver commented May 11, 2012

wesm commented May 12, 2012

vgoklani commented May 12, 2012

Komnomnomnom commented May 12, 2012

wesm commented May 19, 2012

Komnomnomnom commented May 19, 2012

wesm commented May 19, 2012

wesm commented May 19, 2012

wesm commented May 19, 2012

Komnomnomnom commented May 19, 2012

Komnomnomnom commented May 20, 2012

Komnomnomnom commented May 20, 2012

wesm commented May 20, 2012

Komnomnomnom commented May 20, 2012

wesm commented May 20, 2012

jreback commented Jun 11, 2013