Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: JSON #3876

Merged
merged 4 commits into from
Jun 13, 2013
Merged

ENH: JSON #3876

merged 4 commits into from
Jun 13, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jun 13, 2013

revised argument structure for read_json to control dtype conversions, which are all on by default:

  • convert_axes : if you for some reason want to turn off dtype conversion on the axes (only really necessary if you have string-like numbers)
  • dtype : now accepts a dict of name -> dtype for specific conversions, or True to try to coerce all
  • convert_dates : default True (in conjunction with keep_default_dates determines which columns to attempt date conversion)

DOC updates for all

@hayd
Copy link
Contributor

hayd commented Jun 13, 2013

Does this fix this example? Do you mind adding as a test case:

In [5]: pd.read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]')
Out[5]:
   0  1
a  1  2
b  2  1

@jreback
Copy link
Contributor Author

jreback commented Jun 13, 2013

@wesm
cc @hayd
cc @Komnomnomnom

so they way I view the numpy flag is it assumes an ordering or is this only in combindation with labelled=True

(Pdb) read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=False)
   a  b
0  1  2
1  1  2

passes numpy=True also passes labelled=True

(Pdb) read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=True)
   0  1
a  1  2
b  2  1

Here's the direct returns

(Pdb) pd.json.loads('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=False)
[{u'a': 1, u'b': 2}, {u'a': 1, u'b': 2}]
(Pdb) pd.json.loads('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=True,labelled=True)
(array([[1, 2],
       [2, 1]]), None, array([u'a', u'b'], 
      dtype='<U1'))

@Komnomnomnom
Copy link
Contributor

@jreback with numpy=True the parser tries to decode directly to numpy arrays and needs the orient parameter to be correct, otherwise it will probably fail, or in this case give you transposed output.

In [4]: pd.read_json('[{"a": 1, "b": 2}, {"b":2, "a" :1}]',numpy=True, orient='records')
Out[4]: 
   a  b
0  1  2
1  2  1

The labelled option is only used in conjunction with the numpy option and denotes that a pandas object is encoded to JSON objects e.g. encoded with orients like index, columns or records. When decoding labelled input it only constructs the index and column arrays once, in this case from the first JSON object, and assumes the rest of the entries will appear in the same order. This was done for convenience / performance and should work fine if an object is encoded / decoded using pandas methods, as order should be conserved. It obviously falls apart when consuming JSON from other sources though (the JSON spec makes no guarantee about order AFAIK).

So the numpy-enhanced decoder will need to be altered to support unordered input, or do you think it is enough to document that it expects ordered input and encourage users to use numpy=False if this is not the case? I'm happy to make the changes if support for unordered input is desired in the numpy version, but it'll be a few days.

BTW on a semi-related note I've just noticed that the doc string info for DataFrame encoding got lost along the way somewhere. read_json's info on orient only applies to Series. For DataFrame there's a bit more to it:

'''
orient : {'split', 'records', 'index', 'columns', 'values'},
             default 'columns'
        The format of the JSON string
        split : dict like
            {index -> [index], columns -> [columns], data -> [values]}
        records : list like [{column -> value}, ... , {column -> value}]
        index : dict like {index -> {column -> value}}
        columns : dict like {column -> {index -> value}}
        values : just the values array
'''

(taken from here)

@jreback
Copy link
Contributor Author

jreback commented Jun 13, 2013

@Komnomnomnom

2nd issue first, the doc string appears for me in ipython (pd.read_json?)
where are you looking?

pd.read_json is now a top-level method, rather than on a specific return object (this makes it consistent)
with other io routines, so the doc-string is not there on an object (nor can you call the method on
DataFrame.read_json

1st issue:

I undestand why you did make the numpy option as you did. It think it is fine, but assumes an ordering in the input stream (which I agree is maybe not so prevalent).

numpy=True is the default option now, maybe I should flip them, and so if you are sure you input is ordered then you can pass numpy=True

I am not sure how json in the real-world is usually presented, anyone? @hayd

ENH: changed dtype argument to accept a dict for a per-column dtype conversion, or
     turn off conversion (default is True)

ENH: changed parse_dates to convert_dates, now defaulting to True

BUG: not processing correctly some parsable JSON
@jreback
Copy link
Contributor Author

jreback commented Jun 13, 2013

@Komnomnomnom
I changed the default to not do numpy parsing first, works fine

is it (performance wise) expensive to verify that the ordering is consistent when running with numpy=True?

@hayd
Copy link
Contributor

hayd commented Jun 13, 2013

@jreback I expect that there is a lot of funky json out there (and it's unordered in the spec), so in my opinion unordered should be assumed (default)...

@jreback
Copy link
Contributor Author

jreback commented Jun 13, 2013

yep....latest commit now makes that the default (and I don't think its that much slower in any event)

happens to be slightly faster here

In [1]: %time with open('citylots.json', 'r') as f: pd.read_json(f.read())
CPU times: user 5.88 s, sys: 0.38 s, total: 6.26 s
Wall time: 6.28 s

In [2]: %time with open('citylots.json', 'r') as f: pd.read_json(f.read(),numpy=True)
CPU times: user 6.33 s, sys: 0.18 s, total: 6.51 s
Wall time: 6.53 s

@Komnomnomnom
Copy link
Contributor

@jreback the docstring appears but the information for the orient parameter only applies to Series. orient has some different values and the JSON format is slightly different for DataFrame.

Here's the docstring I see

Convert JSON string to pandas object

Parameters
----------
filepath_or_buffer : a VALID JSON string or file handle / StringIO. The string could be
    a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host
    is expected. For instance, a local file could be
    file ://localhost/path/to/table.json
orient : {'split', 'records', 'index'}, default 'index'
    The format of the JSON string
    split : dict like
        {index -> [index], name -> name, data -> [values]}
    records : list like [value, ... , value]
    index : dict like {index -> value}
typ : type of object to recover (series or frame), default 'frame'
dtype : dtype of the resulting object
numpy: direct decoding to numpy arrays. default True but falls back
    to standard decoding if a problem occurs.
parse_dates : a list of columns to parse for dates; If True, then try to parse datelike columns
    default is False
keep_default_dates : boolean, default True. If parsing dates,
    then parse the default datelike columns

Returns
-------
result : Series or DataFrame

I guess the information presented for orient needs to be split into DataFrame and Series details, the rest of the parameters are the same for both I think.

If numpy=True is not the default then I don't think it should fall back to the standard decoder, if it fails then the caller should handle it, as they've specifically requested direct-to-numpy behaviour and should know if it failed. What do you think?

Regarding supporting unordered input and the performance impact, I guess once the index / columns (labels) are parsed from the first JSON object into numpy arrays it would then need to perform a lookup on those arrays to get the correct position when parsing subsequent JSON objects. Another parameter to turn this behaviour on/off probably?

The numpy version is only valuable if you're decoding numeric dtypes (i.e. no strings apart from the columns and index labels). If there are any strings in the values it will fall back to the default decoder anyway. I don't know what json you tested with but I'm guessing that's probably why it ends up being slower for you (it tries numpy first and then falls back to the standard decoder).

@hayd
Copy link
Contributor

hayd commented Jun 13, 2013

That big city json I found is a terrible example (sorry) - it's really nested.

+1 on not falling back from numpy=True.

@Komnomnomnom
Copy link
Contributor

Also always passing in dtype=None to the decoder when using numpy=True may cause some issues. If no dtype is provided the decoder performs a pretty basic attempt to sniff the dtype for the numpy arrays it is about to fill and may get it wrong i.e. int instead of say a desired float dtype.

Seeing as the dtype param is now used for coercing values after decoding perhaps the numpy param could be altered to accept a dtype as well as boolean, and this would be passed to the decoder for its own use? Or am I just making things too confusing?

…ordered JSON

      eliminated fallback parsing with numpy=True; This will raise ValueError
      if it fails to parse (a known case are strings in the frame data)
@jreback
Copy link
Contributor Author

jreback commented Jun 13, 2013

Ok, default is changed now to numpy=False, and no fallback parsing when True.
The only case it came up an error was all (or partial) strings in the Frame

@Komnomnomnom fixed up the docstrings & docs to reflect Series/DataFrame differences with orientation

as far as passing dtype=None I found no issues with this; The only future request and not sure how useful, would be to have you turn off ALL dtype inference (and just return as object); this is the theory behind passing dtype=False but you still parse floats/ints. I don't think this is that important as the inference is good on your end and I have options for most of the potentially problemetic inference once you are done parsing anyhow..

just a thought

In [1]: pd.read_json(DataFrame('foo',index=range(3),columns=list('AB')).to_json())
Out[1]: 
     A    B
0  foo  foo
1  foo  foo
2  foo  foo

In [2]: pd.read_json(DataFrame('foo',index=range(3),columns=list('AB')).to_json(),numpy=True)
ValueError: Cannot decode multidimensional arrays with variable length elements to numpy

@Komnomnomnom
Copy link
Contributor

That's great thanks @jreback :-)

As for the dtype inference you could be right, I initially wrote it with just basic types in mind and deliberately fell back for anything else (hence the ValueError). I can't remember exactly but I think there were some complications with building numpy arrays in C and variable length elements (strings, objects etc).

jreback added a commit that referenced this pull request Jun 13, 2013
@jreback jreback merged commit 3d98544 into pandas-dev:master Jun 13, 2013
@jreback
Copy link
Contributor Author

jreback commented Jun 13, 2013

ok....let's everyone play around some more with this...but +1 to @Komnomnomnom for making it easy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants