json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

aerymilts · 2018-03-07T06:31:27Z

Code Sample, a copy-pastable example if possible

data_fail_to_normalize = \
        [{'info': None}, 
         
        {'info': 
         {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
        'author_name': 
         {'first': 'Jane', 'last_name': 'Doe'}
        }]

data_partial_fail = \
        [{'info': None, 
         'author_name': 
         {'first': 'Smith', 'last_name': 'Appleseed'}
        }, 
        
        {'info': 
         {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
        'author_name': 
         {'first': 'Jane', 'last_name': 'Doe'}
        }]

>>> import pandas as pd
>>> pd.io.json.json_normalize(data_fail_to_normalize)

Output 1

	author_name	info
0	nan	None
1	{'first': 'Jane', 'last_name': 'Doe'}	{'created_at': '11/08/1993', 'last_updated': '26/05/2012'}

>>> pd.io.json.json_normalize(data_partial_fail)

Output 2

	author_name.first	author_name.last_name	info	info.created_at	info.last_updated
0	Smith	Appleseed	nan	nan	nan
1	Jane	Doe	nan	11/08/1993	26/05/2012

Problem description

I expected that the json_normalize function takes into account the presence of NoneTypes in the dictionaries. This leads to 2 separate issues (If I should open this as 2 separate issues, let me know).

I have already written a fix that solves this issue - if anyone else can validate that this is not working as intended, I can set up a PR.

Output 1

Does not unnest json after encountering NoneType at first instance of subrecord, line 192 of pandas/io/json/normalize.py

Output 2

Keeps the None value when encountered, [{k: {'alpha': 'foo', 'beta': 'bar'}}, {k: None}], see nested_to_record function. Creates additional column of nans which would not otherwise occur if that particular key was removed.

Expected Output

Output 1

	author_name.first	author_name.last_name	info.created_at	info.last_updated
0	nan	nan	nan	nan
1	Jane	Doe	11/08/1993	26/05/2012

Output 2

	author_name.first	author_name.last_name	info.created_at	info.last_updated
0	Smith	Appleseed	nan	nan
1	Jane	Doe	11/08/1993	26/05/2012

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-104-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-03-08T17:53:46Z

I would personally prefer the second option. Feel free to post the PR!

aerymilts · 2018-03-18T07:17:45Z

@gfyoung Were you referring to opening this as two separate issues?

gfyoung · 2018-03-18T07:41:37Z

@aerymilts : I was referring to the second output being preferable to the first.

…0030) TST: additional coverage for the test cases from (pandas-dev#20030) DOC: added changes to whatsnew/v0.23.0.txt (pandas-dev#20030)

)

…ame_describe * upstream/master: (158 commits) Add link to "Craft Minimal Bug Report" blogpost (pandas-dev#20431) BUG: fixed json_normalize for subrecords with NoneTypes (pandas-dev#20030) (pandas-dev#20399) BUG: ExtensionArray.fillna for scalar values (pandas-dev#20412) DOC" update the Pandas core window rolling count docstring" (pandas-dev#20264) DOC: update the pandas.DataFrame.plot.hist docstring (pandas-dev#20155) DOC: Only use ~ in class links to hide prefixes. (pandas-dev#20402) Bug: Allow np.timedelta64 objects to index TimedeltaIndex (pandas-dev#20408) DOC: add disallowing of Series construction of len-1 list with index to whatsnew (pandas-dev#20392) MAINT: Remove weird pd file DOC: update the Index.isin docstring (pandas-dev#20249) BUG: Handle all-NA blocks in concat (pandas-dev#20382) DOC: update the pandas.core.resample.Resampler.fillna docstring (pandas-dev#20379) BUG: Don't raise exceptions splitting a blank string (pandas-dev#20067) DOC: update the pandas.DataFrame.cummax docstring (pandas-dev#20336) DOC: update the pandas.core.window.x.mean docstring (pandas-dev#20265) DOC: update the api.types.is_number docstring (pandas-dev#20196) Fix linter (pandas-dev#20389) DOC: Improved the docstring of pandas.Series.dt.to_pytimedelta (pandas-dev#20142) DOC: update the pandas.Series.dt.is_month_end docstring (pandas-dev#20181) DOC: update the window.Rolling.min docstring (pandas-dev#20263) ...

…0030) (pandas-dev#20399)

gfyoung added the IO JSON read_json, to_json, json_normalize label Mar 8, 2018

gfyoung added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO JSON read_json, to_json, json_normalize and removed IO JSON read_json, to_json, json_normalize labels Mar 8, 2018

aerymilts mentioned this issue Mar 18, 2018

BUG: fixed json_normalize for subrecords with NoneTypes (#20030) #20399

Merged

4 tasks

jreback added this to the 0.23.0 milestone Mar 20, 2018

aerymilts added a commit to aerymilts/pandas that referenced this issue Mar 20, 2018

TST: updated test purposes and how nan is declared (pandas-dev#20030)

76d2d91

jreback closed this as completed in #20399 Mar 20, 2018

jreback pushed a commit that referenced this issue Mar 20, 2018

BUG: fixed json_normalize for subrecords with NoneTypes (#20030) (#20399

01882ba

)

dworvos pushed a commit to dworvos/pandas that referenced this issue Apr 2, 2018

BUG: fixed json_normalize for subrecords with NoneTypes (pandas-dev#2…

086c68d

…0030) (pandas-dev#20399)

This was referenced May 24, 2018

json_normalize gives KeyError in 0.23 #21158

Closed

BUG: Fix nested_to_record with None values in nested levels #21164

Merged

WillAyd mentioned this issue Jun 7, 2018

JSON nested_to_record Silently Drops Top-Level None Values #21356

Closed

wujiayikelly mentioned this issue Jul 9, 2018

json_normalize should supply empty columns if record_path are not present #21830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

aerymilts commented Mar 7, 2018

INSTALLED VERSIONS

gfyoung commented Mar 8, 2018

aerymilts commented Mar 18, 2018

gfyoung commented Mar 18, 2018

json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

Comments

aerymilts commented Mar 7, 2018

Code Sample, a copy-pastable example if possible

Output 1

Output 2

Problem description

Output 1

Output 2

Expected Output

Output 1

Output 2

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Mar 8, 2018

aerymilts commented Mar 18, 2018

gfyoung commented Mar 18, 2018

Output of `pd.show_versions()`