-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add JSON export option for DataFrame #631 #1226
Conversation
Bundle custom ujson lib for DataFrame and Series JSON export & import.
I don't think we should be bundling a json encoder. There's a JSON module in Python since 2.6, and it's simple enough to install other implementations if the user needs e.g. more speed. Let's just have a little shim module that will try to import JSON APIs in order of preference. |
@takluyver there's a bit of a discussion already at #631, not sure if you're aware of it. I should have added more info in the description though, sorry. The main motivation for including this fork of ujson in pandas is it specifically works with pandas datatypes at a very low level (it is pure C) so it wouldn't be of any benefit to non-pandas users. If a user wants to use their own favourite JSON decoder they would obviously still be free to do so. However I'll admit that high performance JSON serialisation is probably a minor requirement for most people's needs so I'm happy either way. |
Thanks, I wasn't aware of that. I'm still not wild on the approach - it seems like it will make for a heavier library and a bigger codebase to maintain. But Wes seems to be happy with the idea, so you don't have to worry about my objections ;-) A couple of practical questions: Your README has a lot of benchmarks, but I haven't taken the time to work out what they all mean. Can you summarise: what sort of improvement do we see from forking ujson, versus the best we could do with a stock build? What sort of workloads do we envisage - is the bottleneck when you have one huge dataframe, or thousands of smaller ones? Assuming ujson is still actively developed, how important and how easy will it be to get updates from upstream in the future? |
When working with numpy types:
DataFrames:
And this is on top of ujson already being one of the speediest JSON libraries. My specific use case is the need to share lots of Dataframes between Python processes (and other languages) with a mix of sizes. JSON was the natural choice for us because of portability, and we wanted to get the best performance out of it. ujson is a relatively small and stable library. There has only been some minor patches in the last few months and the author seems pretty open to pull requests etc. I'll be merging any applicable upstream changes to my fork and I'd be happy to do the same for pandas if it ends up being integrated. I'm pretty familiar with the ujson code now (it's really only four files) and I'd likewise be happy to deal with any bugs / enhancements coming from pandas usage too. It's worth noting that the library is split in two parts, one being the language agnostic JSON encoder / decoder and the other being the Python bindings. I managed to keep the bulk of my changes limited to the Python bindings and even then they are new functions / new code rather than changes to existing functions. My point being upstream changes should be easy enough to merge. |
Thanks, that all sounds pretty reasonable, and I'm satisfied that this is worth doing. |
This is really excellent work, thanks so much for doing this. Yeah, I was initially a bit hesitant to bundle ujson, but given that more and more people want to do JS<->pandas integration, getting the best possible encoding/decoding performance and being able to access the NumPy arrays directly in the C encoder makes a lot of sense. We'll have to periodically pull in upstream changes from ujson, I guess. |
just curious, how would this handle nested JSON? i.e. j = {'person' : {'first_name' : 'Albert', 'last_name' : 'Einstein', 'occupation': {'job_title': 'Theoretical Physicist', 'institution' : 'Princeton University', 'accomplishments':['Brownian motion', 'Special Relativity', 'General Relativity']}}} df = pandas.DataFrame(j) df = ? |
From a performance standpoint not very well I'm afraid, the numpy with labels handling bombs out if it detects more than two levels of nesting. It probably could be tweaked to deal with this better but when decoding with complex types (i.e. objects and strings) a Python list is needed as an intermediary anyway, so I'm not sure there'd be any advantage. The good news is the methods in DataFrame and Series fall back to standard decoding if the numpy version fails so it should still work as expected, albeit without the performance improvements. Just tested it out to make sure In [1]: from pandas import DataFrame
In [2]: j = {'person' : {'first_name' : 'Albert', 'last_name' : 'Einstein', 'occupation': {'job_title': 'Theoretical Physicist', 'institution' : 'Princeton University', 'accomplishments':['Brownian motion', 'Special Relativity', 'General Relativity']}}}
In [3]: df = DataFrame(j)
In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, first_name to occupation
Data columns:
person 3 non-null values
dtypes: object(1)
In [5]: df['person']['occupation']
Out[5]:
{'accomplishments': ['Brownian motion',
'Special Relativity',
'General Relativity'],
'institution': 'Princeton University',
'job_title': 'Theoretical Physicist'}
In [6]: df.to_json()
Out[6]: '{"person":{"first_name":"Albert","last_name":"Einstein","occupation":{"accomplishments":["Brownian motion","Special Relativity","General Relativity"],"institution":"Princeton University","job_title":"Theoretical Physicist"}}}'
In [7]: json = df.to_json()
In [8]: DataFrame.from_json(json)
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, first_name to occupation
Data columns:
person 3 non-null values
dtypes: object(1)
In [9]: DataFrame.from_json(json)['person']['occupation']
Out[9]:
{u'accomplishments': [u'Brownian motion',
u'Special Relativity',
u'General Relativity'],
u'institution': u'Princeton University',
u'job_title': u'Theoretical Physicist'} Edit: I should have mentioned the comments above are related to decoding only. Encoding does not suffer the same issues and the performance improvements still apply. |
Hey @Komnomnomnom I started to see if I can merge this and am getting a segfault on my system (Python 2.7.2, NumPy 1.6.1, 64-bit Ubuntu). The object returned by series.to_json(orient='columns') in
I can probably track down the problem, but I figure since you wrote the C code that you'd be more able if you can reproduce the error. |
Hi Wes, I just tried with my local clone of my fork and had no segmentation fault (all tests passed when I made my commit / pull request). I'll merge in the latest from pandas master and see what happens. For the record I'm using Pyton 2.7.2, numpy 1.6.1 on 64-bit OSX. |
I put in print statements
and here's the output
Somehow the result of |
It looks like something is getting corrupted:
|
It looks like the culprit must be |
Hmm I've merged to the latest on pandas master, seeing some failed tests but still no segmentation faults, no corruption and those print statements work fine. I'm going to try in a Ubuntu VM see if I can get to the bottom of it. |
…algos extension
Pandas was using some of the enums and structures exposed by its headers. By creating its own local copies of these, it is possible to allow the numpy ABI to be improved while in its experimental state.
As mentioned in #1098.
Ugh, I did not know merging into my fork would flood this pull request. It might be best to delete my current fork and submit a new pull request once this issue is sorted. The good news is after a bit of setup I was able to reproduce the memory corruption you are seeing in my Ubuntu VM. It appears to happen even when |
I believe I've found the problem. The reference count of the object being encoded was mistakenly being decremented twice. I presume it was just chance that the memory layout or garbage collection schedule on my laptop meant the object wasn't being deleted. There are a few more things I've noticed (like |
That will teach you not to develop in master ;) BTW, you don't need to refork-- you can |
Ooop too late I re-forked a few minutes ago....hope this doesn't cause further problems... :-/ BTW if you want to test the fix on your machine the offending line was 278 in Also I'm still noticing some timestamp weirdness, I'm guessing there were changes recently in master regarding datetime64 ? Is this work still ongoing? |
Yes, the work is still ongoing. Test failures in JSON encoding/decoding or elsewhere (pydata/master test suite passes cleanly for me)? I should be able to fix them myself |
implemented via #3804 |
No description provided.