Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add JSON export option for DataFrame #631 #1226

Closed
wants to merge 114 commits into from
Closed

ENH: Add JSON export option for DataFrame #631 #1226

wants to merge 114 commits into from

Conversation

Komnomnomnom
Copy link
Contributor

No description provided.

Bundle custom ujson lib for DataFrame and Series JSON export & import.
@takluyver
Copy link
Contributor

I don't think we should be bundling a json encoder. There's a JSON module in Python since 2.6, and it's simple enough to install other implementations if the user needs e.g. more speed. Let's just have a little shim module that will try to import JSON APIs in order of preference.

@Komnomnomnom
Copy link
Contributor Author

@takluyver there's a bit of a discussion already at #631, not sure if you're aware of it. I should have added more info in the description though, sorry. The main motivation for including this fork of ujson in pandas is it specifically works with pandas datatypes at a very low level (it is pure C) so it wouldn't be of any benefit to non-pandas users. If a user wants to use their own favourite JSON decoder they would obviously still be free to do so.

However I'll admit that high performance JSON serialisation is probably a minor requirement for most people's needs so I'm happy either way.

@takluyver
Copy link
Contributor

Thanks, I wasn't aware of that. I'm still not wild on the approach - it seems like it will make for a heavier library and a bigger codebase to maintain. But Wes seems to be happy with the idea, so you don't have to worry about my objections ;-)

A couple of practical questions:

Your README has a lot of benchmarks, but I haven't taken the time to work out what they all mean. Can you summarise: what sort of improvement do we see from forking ujson, versus the best we could do with a stock build?

What sort of workloads do we envisage - is the bottleneck when you have one huge dataframe, or thousands of smaller ones?

Assuming ujson is still actively developed, how important and how easy will it be to get updates from upstream in the future?

@Komnomnomnom
Copy link
Contributor Author

When working with numpy types:

  • encoding : no real advantage, sometimes even a disadvantage, the numpy to list conversion is very efficient.
  • decoding : about 1.5 to 2x the speed (when working with numeric types).

DataFrames:

  • encoding : depending on the desired format and the nature (shape, size) of the input a speedup of about 2x to 10x. Although there's cases where it's about 20x (e.g. 200x4 zeros).
  • decoding : again depending on the encoded format a speedup of about 2x to 3x is typical, but can be up to 20x.
  • for time series data encoding & decoding is usually better, or on a par with, encoding the corresponding Python basic type (i.e. a dict). For time series data with datetime indices I'm seeing about a 7x speedup for encoding DataFrames and about 3x for decoding. In the best case, where a transpose would otherwise be necessary, the speedup is about 15 to 20x.

And this is on top of ujson already being one of the speediest JSON libraries.

My specific use case is the need to share lots of Dataframes between Python processes (and other languages) with a mix of sizes. JSON was the natural choice for us because of portability, and we wanted to get the best performance out of it.

ujson is a relatively small and stable library. There has only been some minor patches in the last few months and the author seems pretty open to pull requests etc. I'll be merging any applicable upstream changes to my fork and I'd be happy to do the same for pandas if it ends up being integrated. I'm pretty familiar with the ujson code now (it's really only four files) and I'd likewise be happy to deal with any bugs / enhancements coming from pandas usage too. It's worth noting that the library is split in two parts, one being the language agnostic JSON encoder / decoder and the other being the Python bindings. I managed to keep the bulk of my changes limited to the Python bindings and even then they are new functions / new code rather than changes to existing functions. My point being upstream changes should be easy enough to merge.

@takluyver
Copy link
Contributor

Thanks, that all sounds pretty reasonable, and I'm satisfied that this is worth doing.

@wesm
Copy link
Member

wesm commented May 12, 2012

This is really excellent work, thanks so much for doing this. Yeah, I was initially a bit hesitant to bundle ujson, but given that more and more people want to do JS<->pandas integration, getting the best possible encoding/decoding performance and being able to access the NumPy arrays directly in the C encoder makes a lot of sense. We'll have to periodically pull in upstream changes from ujson, I guess.

@vgoklani
Copy link

just curious, how would this handle nested JSON? i.e.

j = {'person' : {'first_name' : 'Albert', 'last_name' : 'Einstein', 'occupation': {'job_title': 'Theoretical Physicist', 'institution' : 'Princeton University', 'accomplishments':['Brownian motion', 'Special Relativity', 'General Relativity']}}}

df = pandas.DataFrame(j)

df = ?

@Komnomnomnom
Copy link
Contributor Author

From a performance standpoint not very well I'm afraid, the numpy with labels handling bombs out if it detects more than two levels of nesting. It probably could be tweaked to deal with this better but when decoding with complex types (i.e. objects and strings) a Python list is needed as an intermediary anyway, so I'm not sure there'd be any advantage.

The good news is the methods in DataFrame and Series fall back to standard decoding if the numpy version fails so it should still work as expected, albeit without the performance improvements.

Just tested it out to make sure

In [1]: from pandas import DataFrame
In [2]: j = {'person' : {'first_name' : 'Albert', 'last_name' : 'Einstein', 'occupation': {'job_title': 'Theoretical Physicist', 'institution' : 'Princeton University', 'accomplishments':['Brownian motion', 'Special Relativity', 'General Relativity']}}}

In [3]: df = DataFrame(j)

In [4]: df
Out[4]: 
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, first_name to occupation
Data columns:
person    3  non-null values
dtypes: object(1)

In [5]: df['person']['occupation']
Out[5]: 
{'accomplishments': ['Brownian motion',
  'Special Relativity',
  'General Relativity'],
 'institution': 'Princeton University',
 'job_title': 'Theoretical Physicist'}

In [6]: df.to_json()
Out[6]: '{"person":{"first_name":"Albert","last_name":"Einstein","occupation":{"accomplishments":["Brownian motion","Special Relativity","General Relativity"],"institution":"Princeton University","job_title":"Theoretical Physicist"}}}'

In [7]: json = df.to_json()

In [8]: DataFrame.from_json(json)
Out[8]: 
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, first_name to occupation
Data columns:
person    3  non-null values
dtypes: object(1)

In [9]: DataFrame.from_json(json)['person']['occupation']
Out[9]: 
{u'accomplishments': [u'Brownian motion',
  u'Special Relativity',
  u'General Relativity'],
 u'institution': u'Princeton University',
 u'job_title': u'Theoretical Physicist'}

Edit: I should have mentioned the comments above are related to decoding only. Encoding does not suffer the same issues and the performance improvements still apply.

@wesm
Copy link
Member

wesm commented May 19, 2012

Hey @Komnomnomnom I started to see if I can merge this and am getting a segfault on my system (Python 2.7.2, NumPy 1.6.1, 64-bit Ubuntu).

The object returned by series.to_json(orient='columns') in _check_orient(series, "columns", dtype=dtype) from test_series.py, line 341, appears to be NULL (the gdb backtrace showed the segfault in from_json, but the data returned by to_json is malformed):

test_from_json_to_json (__main__.TestSeries) ... > /home/wesm/code/pandas/pandas/tests/test_series.py(324)_check_orient()
-> foo
(Pdb) type(series.to_json(orient=orient))
Segmentation fault

I can probably track down the problem, but I figure since you wrote the C code that you'd be more able if you can reproduce the error.

@Komnomnomnom
Copy link
Contributor Author

Hi Wes, I just tried with my local clone of my fork and had no segmentation fault (all tests passed when I made my commit / pull request). I'll merge in the latest from pandas master and see what happens.

For the record I'm using Pyton 2.7.2, numpy 1.6.1 on 64-bit OSX.

@wesm
Copy link
Member

wesm commented May 19, 2012

I put in print statements

  printf("%s\n", ret);

  printf("length: %d\n", strlen(ret));

and here's the output

{"2F4SMHsw4I":-1.4303216796,"nMi4KBCmg7":-1.32552412,"Molf5Ue3kF":-1.2705465829,"9kkHHlfXPA":-0.8877964843,"6E3ma1UHv7":-0.850191537,"2F5JdoFIqQ":-0.8013936673,"VzJclGGLsr":-0.7985248155,"cI4bkkV9MH":-0.7000873004,"TxS6mJ8UuP":-0.6864885751,"2jGSZe0rmF":-0.6708315768,"oHooxHeHqu":-0.6482430589,"HuqOm1mf57":-0.624890804,"bEWcPipOk9":-0.5669391204,"zpy7FQCGgp":-0.3383151716,"nYIL8VPVT3":-0.2663003599,"x0YmXOvJ49":-0.1767082308,"bJm3Pbjx14":-0.1510545428,"E51nrgW9Yt":0.0101299091,"QycwIANnTx":0.1575097137,"8wVdQ8RIdQ":0.2073634038,"90c5KPKyeS":0.2539122603,"eERFnAAd8k":0.3728367,"tZLEG6seKV":0.4332938883,"ehdTUcPK7A":0.457039038,"biYpVDeFiz":0.5021518808,"JlVXVA62Zz":0.5918523437,"2UTfjHGMEy":0.6413052158,"5VOyIV1TYs":0.6828158342,"WyNfVlEOK3":1.1809723971,"YrW1NS7fCX":1.3862224711}
length: 790

pandas/src/ujson/python/objToJSON.c: MARK(1490)
Segmentation fault

Somehow the result of PyString_FromString is malformed, it seems like maybe ret is not null-terminated? I suspect this is a red herring, though

@wesm
Copy link
Member

wesm commented May 19, 2012

It looks like something is getting corrupted:

14:09 ~/code/pandas  (json-export)$ python pandas/tests/test_ujson.py 
nose.config: INFO: Ignoring files matching ['^\\.', '^_', '^setup\\.py$']
testArrayNumpyExcept (__main__.NumpyJSONTests) ... ok
testArrayNumpyLabelled (__main__.NumpyJSONTests) ... ok
testArrays (__main__.NumpyJSONTests) ... ok
testBool (__main__.NumpyJSONTests) ... ok
testBoolArray (__main__.NumpyJSONTests) ... ok
testFloat (__main__.NumpyJSONTests) ... ok
testFloatArray (__main__.NumpyJSONTests) ... ok
testFloatMax (__main__.NumpyJSONTests) ... ok
testInt (__main__.NumpyJSONTests) ... ok
testIntArray (__main__.NumpyJSONTests) ... ok
testIntMax (__main__.NumpyJSONTests) ... ok
testDataFrame (__main__.PandasJSONTests) ... > /home/wesm/code/pandas/pandas/tests/test_ujson.py(943)testDataFrame()
-> foo
(Pdb) u
> /home/wesm/epd/lib/python2.7/unittest/case.py(327)run()
-> testMethod()
(Pdb) d
> /home/wesm/code/pandas/pandas/tests/test_ujson.py(943)testDataFrame()
-> foo
(Pdb) l
938     class PandasJSONTests(TestCase):
939     
940         def testDataFrame(self):
941             df = DataFrame([[1,2,3], [4,5,6]], index=['a', 'b'], columns=['x', 'y', 'z'])
942     
943  ->         foo
944             # column indexed
945             outp = DataFrame(ujson.decode(ujson.encode(df)))
946             self.assertTrue((df == outp).values.all())
947             assert_array_equal(df.columns, outp.columns)
948             assert_array_equal(df.index, outp.index)
(Pdb) ujson.encode(df)
'{"x":{"a":1,"b":4},"y":{"a":2,"b":5},"z":{"a":3,"b":6}}'
(Pdb) print df
Segmentation fault

@wesm
Copy link
Member

wesm commented May 19, 2012

It looks like the culprit must be NpyArr_encodeLabels. I'm not enough of a C guru to see what might be going wrong-- everything works here except encoding Series/DataFrame, and inside there is plenty of twiddling of bytes. Let me know if you manage to figure it out =/

@Komnomnomnom
Copy link
Contributor Author

Hmm I've merged to the latest on pandas master, seeing some failed tests but still no segmentation faults, no corruption and those print statements work fine. I'm going to try in a Ubuntu VM see if I can get to the bottom of it.

@Komnomnomnom
Copy link
Contributor Author

Ugh, I did not know merging into my fork would flood this pull request. It might be best to delete my current fork and submit a new pull request once this issue is sorted.

The good news is after a bit of setup I was able to reproduce the memory corruption you are seeing in my Ubuntu VM. It appears to happen even when NpyArr_encodeLabels is not involved. There is also some weirdness with timestamp conversion but I think that is a separate issue.

@Komnomnomnom
Copy link
Contributor Author

I believe I've found the problem. The reference count of the object being encoded was mistakenly being decremented twice. I presume it was just chance that the memory layout or garbage collection schedule on my laptop meant the object wasn't being deleted.

There are a few more things I've noticed (like build clean deleting the C files and datetime conversion is now not working) which I'll fix before submitting a new pull request. I'll close this one for now and I'll create a feature branch on a new fork to avoid this mess happening again.

@wesm
Copy link
Member

wesm commented May 20, 2012

That will teach you not to develop in master ;) BTW, you don't need to refork-- you can git reset --hard upstream/master and force-push that to github. Just make sure you make a branch of your current master with the JSON work

@Komnomnomnom
Copy link
Contributor Author

Ooop too late I re-forked a few minutes ago....hope this doesn't cause further problems... :-/

BTW if you want to test the fix on your machine the offending line was 278 in NpyArr_iterEnd
cb7c6ae#L6R279
(That line should be removed.)

Also I'm still noticing some timestamp weirdness, I'm guessing there were changes recently in master regarding datetime64 ? Is this work still ongoing?

@wesm
Copy link
Member

wesm commented May 20, 2012

Yes, the work is still ongoing. Test failures in JSON encoding/decoding or elsewhere (pydata/master test suite passes cleanly for me)? I should be able to fix them myself

@jreback
Copy link
Contributor

jreback commented Jun 11, 2013

implemented via #3804

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.