-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: JSON #3876
ENH: JSON #3876
Conversation
Does this fix this example? Do you mind adding as a test case:
|
@wesm so they way I view the
passes
Here's the direct returns
|
@jreback with In [4]: pd.read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=True, orient='records')
Out[4]:
a b
0 1 2
1 2 1 The labelled option is only used in conjunction with the numpy option and denotes that a pandas object is encoded to JSON objects e.g. encoded with orients like index, columns or records. When decoding labelled input it only constructs the index and column arrays once, in this case from the first JSON object, and assumes the rest of the entries will appear in the same order. This was done for convenience / performance and should work fine if an object is encoded / decoded using pandas methods, as order should be conserved. It obviously falls apart when consuming JSON from other sources though (the JSON spec makes no guarantee about order AFAIK). So the numpy-enhanced decoder will need to be altered to support unordered input, or do you think it is enough to document that it expects ordered input and encourage users to use BTW on a semi-related note I've just noticed that the doc string info for '''
orient : {'split', 'records', 'index', 'columns', 'values'},
default 'columns'
The format of the JSON string
split : dict like
{index -> [index], columns -> [columns], data -> [values]}
records : list like [{column -> value}, ... , {column -> value}]
index : dict like {index -> {column -> value}}
columns : dict like {column -> {index -> value}}
values : just the values array
''' (taken from here) |
2nd issue first, the doc string appears for me in ipython (pd.read_json?)
1st issue: I undestand why you did make the numpy option as you did. It think it is fine, but assumes an ordering in the input stream (which I agree is maybe not so prevalent).
I am not sure how json in the real-world is usually presented, anyone? @hayd |
…ther than trying not-numpy for dtypes)
ENH: changed dtype argument to accept a dict for a per-column dtype conversion, or turn off conversion (default is True) ENH: changed parse_dates to convert_dates, now defaulting to True BUG: not processing correctly some parsable JSON
@Komnomnomnom is it (performance wise) expensive to verify that the ordering is consistent when running with |
@jreback I expect that there is a lot of funky json out there (and it's unordered in the spec), so in my opinion unordered should be assumed (default)... |
yep....latest commit now makes that the default (and I don't think its that much slower in any event) happens to be slightly faster here
|
@jreback the docstring appears but the information for the orient parameter only applies to Here's the docstring I see
I guess the information presented for orient needs to be split into DataFrame and Series details, the rest of the parameters are the same for both I think. If Regarding supporting unordered input and the performance impact, I guess once the index / columns (labels) are parsed from the first JSON object into numpy arrays it would then need to perform a lookup on those arrays to get the correct position when parsing subsequent JSON objects. Another parameter to turn this behaviour on/off probably? The numpy version is only valuable if you're decoding numeric dtypes (i.e. no strings apart from the columns and index labels). If there are any strings in the values it will fall back to the default decoder anyway. I don't know what json you tested with but I'm guessing that's probably why it ends up being slower for you (it tries numpy first and then falls back to the standard decoder). |
That big city json I found is a terrible example (sorry) - it's really nested. +1 on not falling back from numpy=True. |
Also always passing in Seeing as the dtype param is now used for coercing values after decoding perhaps the numpy param could be altered to accept a dtype as well as boolean, and this would be passed to the decoder for its own use? Or am I just making things too confusing? |
…ordered JSON eliminated fallback parsing with numpy=True; This will raise ValueError if it fails to parse (a known case are strings in the frame data)
Ok, default is changed now to @Komnomnomnom fixed up the docstrings & docs to reflect Series/DataFrame differences with orientation as far as passing just a thought
|
That's great thanks @jreback :-) As for the dtype inference you could be right, I initially wrote it with just basic types in mind and deliberately fell back for anything else (hence the ValueError). I can't remember exactly but I think there were some complications with building numpy arrays in C and variable length elements (strings, objects etc). |
ok....let's everyone play around some more with this...but +1 to @Komnomnomnom for making it easy! |
revised argument structure for
read_json
to control dtype conversions, which are all on by default:convert_axes
: if you for some reason want to turn off dtype conversion on the axes (only really necessary if you have string-like numbers)dtype
: now accepts a dict of name -> dtype for specific conversions, or True to try to coerce allconvert_dates
: default True (in conjunction withkeep_default_dates
determines which columns to attempt date conversion)DOC updates for all