ENH: JSON #3876

jreback · 2013-06-13T02:53:57Z

revised argument structure for read_json to control dtype conversions, which are all on by default:

convert_axes : if you for some reason want to turn off dtype conversion on the axes (only really necessary if you have string-like numbers)
dtype : now accepts a dict of name -> dtype for specific conversions, or True to try to coerce all
convert_dates : default True (in conjunction with keep_default_dates determines which columns to attempt date conversion)

DOC updates for all

hayd · 2013-06-13T08:58:43Z

Does this fix this example? Do you mind adding as a test case:

In [5]: pd.read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]')
Out[5]:
   0  1
a  1  2
b  2  1

jreback · 2013-06-13T13:09:17Z

@wesm
cc @hayd
cc @Komnomnomnom

so they way I view the numpy flag is it assumes an ordering or is this only in combindation with labelled=True

(Pdb) read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=False)
   a  b
0  1  2
1  1  2

passes numpy=True also passes labelled=True

(Pdb) read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=True)
   0  1
a  1  2
b  2  1

Here's the direct returns

(Pdb) pd.json.loads('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=False)
[{u'a': 1, u'b': 2}, {u'a': 1, u'b': 2}]
(Pdb) pd.json.loads('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=True,labelled=True)
(array([[1, 2],
       [2, 1]]), None, array([u'a', u'b'], 
      dtype='<U1'))

Komnomnomnom · 2013-06-13T14:59:25Z

@jreback with numpy=True the parser tries to decode directly to numpy arrays and needs the orient parameter to be correct, otherwise it will probably fail, or in this case give you transposed output.

In [4]: pd.read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=True, orient='records')
Out[4]: 
   a  b
0  1  2
1  2  1

The labelled option is only used in conjunction with the numpy option and denotes that a pandas object is encoded to JSON objects e.g. encoded with orients like index, columns or records. When decoding labelled input it only constructs the index and column arrays once, in this case from the first JSON object, and assumes the rest of the entries will appear in the same order. This was done for convenience / performance and should work fine if an object is encoded / decoded using pandas methods, as order should be conserved. It obviously falls apart when consuming JSON from other sources though (the JSON spec makes no guarantee about order AFAIK).

So the numpy-enhanced decoder will need to be altered to support unordered input, or do you think it is enough to document that it expects ordered input and encourage users to use numpy=False if this is not the case? I'm happy to make the changes if support for unordered input is desired in the numpy version, but it'll be a few days.

BTW on a semi-related note I've just noticed that the doc string info for DataFrame encoding got lost along the way somewhere. read_json's info on orient only applies to Series. For DataFrame there's a bit more to it:

'''
orient : {'split', 'records', 'index', 'columns', 'values'},
             default 'columns'
        The format of the JSON string
        split : dict like
            {index -> [index], columns -> [columns], data -> [values]}
        records : list like [{column -> value}, ... , {column -> value}]
        index : dict like {index -> {column -> value}}
        columns : dict like {column -> {index -> value}}
        values : just the values array
'''

(taken from here)

jreback · 2013-06-13T15:14:51Z

@Komnomnomnom

2nd issue first, the doc string appears for me in ipython (pd.read_json?)
where are you looking?

pd.read_json is now a top-level method, rather than on a specific return object (this makes it consistent)
with other io routines, so the doc-string is not there on an object (nor can you call the method on
DataFrame.read_json

1st issue:

I undestand why you did make the numpy option as you did. It think it is fine, but assumes an ordering in the input stream (which I agree is maybe not so prevalent).

numpy=True is the default option now, maybe I should flip them, and so if you are sure you input is ordered then you can pass numpy=True

I am not sure how json in the real-world is usually presented, anyone? @hayd

…ther than trying not-numpy for dtypes)

ENH: changed dtype argument to accept a dict for a per-column dtype conversion, or turn off conversion (default is True) ENH: changed parse_dates to convert_dates, now defaulting to True BUG: not processing correctly some parsable JSON

jreback · 2013-06-13T15:29:39Z

@Komnomnomnom
I changed the default to not do numpy parsing first, works fine

is it (performance wise) expensive to verify that the ordering is consistent when running with numpy=True?

hayd · 2013-06-13T15:41:57Z

@jreback I expect that there is a lot of funky json out there (and it's unordered in the spec), so in my opinion unordered should be assumed (default)...

jreback · 2013-06-13T15:44:57Z

yep....latest commit now makes that the default (and I don't think its that much slower in any event)

happens to be slightly faster here

In [1]: %time with open('citylots.json', 'r') as f: pd.read_json(f.read())
CPU times: user 5.88 s, sys: 0.38 s, total: 6.26 s
Wall time: 6.28 s

In [2]: %time with open('citylots.json', 'r') as f: pd.read_json(f.read(),numpy=True)
CPU times: user 6.33 s, sys: 0.18 s, total: 6.51 s
Wall time: 6.53 s

Komnomnomnom · 2013-06-13T15:58:25Z

@jreback the docstring appears but the information for the orient parameter only applies to Series. orient has some different values and the JSON format is slightly different for DataFrame.

Here's the docstring I see

Convert JSON string to pandas object

Parameters
----------
filepath_or_buffer : a VALID JSON string or file handle / StringIO. The string could be
    a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host
    is expected. For instance, a local file could be
    file ://localhost/path/to/table.json
orient : {'split', 'records', 'index'}, default 'index'
    The format of the JSON string
    split : dict like
        {index -> [index], name -> name, data -> [values]}
    records : list like [value, ... , value]
    index : dict like {index -> value}
typ : type of object to recover (series or frame), default 'frame'
dtype : dtype of the resulting object
numpy: direct decoding to numpy arrays. default True but falls back
    to standard decoding if a problem occurs.
parse_dates : a list of columns to parse for dates; If True, then try to parse datelike columns
    default is False
keep_default_dates : boolean, default True. If parsing dates,
    then parse the default datelike columns

Returns
-------
result : Series or DataFrame

I guess the information presented for orient needs to be split into DataFrame and Series details, the rest of the parameters are the same for both I think.

If numpy=True is not the default then I don't think it should fall back to the standard decoder, if it fails then the caller should handle it, as they've specifically requested direct-to-numpy behaviour and should know if it failed. What do you think?

Regarding supporting unordered input and the performance impact, I guess once the index / columns (labels) are parsed from the first JSON object into numpy arrays it would then need to perform a lookup on those arrays to get the correct position when parsing subsequent JSON objects. Another parameter to turn this behaviour on/off probably?

The numpy version is only valuable if you're decoding numeric dtypes (i.e. no strings apart from the columns and index labels). If there are any strings in the values it will fall back to the default decoder anyway. I don't know what json you tested with but I'm guessing that's probably why it ends up being slower for you (it tries numpy first and then falls back to the standard decoder).

hayd · 2013-06-13T16:09:00Z

That big city json I found is a terrible example (sorry) - it's really nested.

+1 on not falling back from numpy=True.

Komnomnomnom · 2013-06-13T16:27:32Z

Also always passing in dtype=None to the decoder when using numpy=True may cause some issues. If no dtype is provided the decoder performs a pretty basic attempt to sniff the dtype for the numpy arrays it is about to fill and may get it wrong i.e. int instead of say a desired float dtype.

Seeing as the dtype param is now used for coercing values after decoding perhaps the numpy param could be altered to accept a dtype as well as boolean, and this would be passed to the decoder for its own use? Or am I just making things too confusing?

…ordered JSON eliminated fallback parsing with numpy=True; This will raise ValueError if it fails to parse (a known case are strings in the frame data)

jreback · 2013-06-13T17:14:33Z

Ok, default is changed now to numpy=False, and no fallback parsing when True.
The only case it came up an error was all (or partial) strings in the Frame

@Komnomnomnom fixed up the docstrings & docs to reflect Series/DataFrame differences with orientation

as far as passing dtype=None I found no issues with this; The only future request and not sure how useful, would be to have you turn off ALL dtype inference (and just return as object); this is the theory behind passing dtype=False but you still parse floats/ints. I don't think this is that important as the inference is good on your end and I have options for most of the potentially problemetic inference once you are done parsing anyhow..

just a thought

In [1]: pd.read_json(DataFrame('foo',index=range(3),columns=list('AB')).to_json())
Out[1]: 
     A    B
0  foo  foo
1  foo  foo
2  foo  foo

In [2]: pd.read_json(DataFrame('foo',index=range(3),columns=list('AB')).to_json(),numpy=True)
ValueError: Cannot decode multidimensional arrays with variable length elements to numpy

Komnomnomnom · 2013-06-13T19:12:50Z

That's great thanks @jreback :-)

As for the dtype inference you could be right, I initially wrote it with just basic types in mind and deliberately fell back for anything else (hence the ValueError). I can't remember exactly but I think there were some complications with building numpy arrays in C and variable length elements (strings, objects etc).

ENH: JSON

jreback · 2013-06-13T19:14:14Z

ok....let's everyone play around some more with this...but +1 to @Komnomnomnom for making it easy!

This was referenced Jun 13, 2013

Handling of nested JSON records #1067

Closed

json round trip exception #3867

Closed

jreback added 3 commits June 13, 2013 11:19

BUG: not processing TypeError on reading some json (so was failing ra…

8f8b177

…ther than trying not-numpy for dtypes)

TST: tests for numpy=True/False differeing in parsing

cbaf1ae

PERF: changed default to numpy=False to have correct parsing using un…

740b10f

…ordered JSON eliminated fallback parsing with numpy=True; This will raise ValueError if it fails to parse (a known case are strings in the frame data)

jreback added a commit that referenced this pull request Jun 13, 2013

Merge pull request #3876 from jreback/json_rt

3d98544

ENH: JSON

jreback merged commit 3d98544 into pandas-dev:master Jun 13, 2013

jreback mentioned this pull request Jun 26, 2013

ENH: Add JSON export option for DataFrame (take 2) #1263

Closed

cpcloud mentioned this pull request Aug 20, 2013

Examples of SQL -> pandas #4611

Closed

WillAyd mentioned this pull request Sep 18, 2019

DEPR: Deprecate numpy argument in read_json #28512

Closed

ggold7046 mentioned this pull request Aug 10, 2023

Modified doc/make.py to run sphinx-build -b linkcheck #54265

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: JSON #3876

ENH: JSON #3876

jreback commented Jun 13, 2013

hayd commented Jun 13, 2013

jreback commented Jun 13, 2013

Komnomnomnom commented Jun 13, 2013

jreback commented Jun 13, 2013

jreback commented Jun 13, 2013

hayd commented Jun 13, 2013

jreback commented Jun 13, 2013

Komnomnomnom commented Jun 13, 2013

hayd commented Jun 13, 2013

Komnomnomnom commented Jun 13, 2013

jreback commented Jun 13, 2013

Komnomnomnom commented Jun 13, 2013

jreback commented Jun 13, 2013

ENH: JSON #3876

ENH: JSON #3876

Conversation

jreback commented Jun 13, 2013

hayd commented Jun 13, 2013

jreback commented Jun 13, 2013

Komnomnomnom commented Jun 13, 2013

jreback commented Jun 13, 2013

jreback commented Jun 13, 2013

hayd commented Jun 13, 2013

jreback commented Jun 13, 2013

Komnomnomnom commented Jun 13, 2013

hayd commented Jun 13, 2013

Komnomnomnom commented Jun 13, 2013

jreback commented Jun 13, 2013

Komnomnomnom commented Jun 13, 2013

jreback commented Jun 13, 2013