Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

Closed
aerymilts opened this issue Mar 7, 2018 · 3 comments · Fixed by #20399
Closed
Labels
Bug IO JSON read_json, to_json, json_normalize Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@aerymilts
Copy link
Contributor

Code Sample, a copy-pastable example if possible

data_fail_to_normalize = \
        [{'info': None}, 
         
        {'info': 
         {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
        'author_name': 
         {'first': 'Jane', 'last_name': 'Doe'}
        }]

data_partial_fail = \
        [{'info': None, 
         'author_name': 
         {'first': 'Smith', 'last_name': 'Appleseed'}
        }, 
        
        {'info': 
         {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
        'author_name': 
         {'first': 'Jane', 'last_name': 'Doe'}
        }]

>>> import pandas as pd
>>> pd.io.json.json_normalize(data_fail_to_normalize)

Output 1

author_name info
0 nan None
1 {'first': 'Jane', 'last_name': 'Doe'} {'created_at': '11/08/1993', 'last_updated': '26/05/2012'}
>>> pd.io.json.json_normalize(data_partial_fail)

Output 2

author_name.first author_name.last_name info info.created_at info.last_updated
0 Smith Appleseed nan nan nan
1 Jane Doe nan 11/08/1993 26/05/2012

Problem description

I expected that the json_normalize function takes into account the presence of NoneTypes in the dictionaries. This leads to 2 separate issues (If I should open this as 2 separate issues, let me know).

I have already written a fix that solves this issue - if anyone else can validate that this is not working as intended, I can set up a PR.

Output 1

Does not unnest json after encountering NoneType at first instance of subrecord, line 192 of pandas/io/json/normalize.py

Output 2

Keeps the None value when encountered, [{k: {'alpha': 'foo', 'beta': 'bar'}}, {k: None}], see nested_to_record function. Creates additional column of nans which would not otherwise occur if that particular key was removed.

Expected Output

Output 1

author_name.first author_name.last_name info.created_at info.last_updated
0 nan nan nan nan
1 Jane Doe 11/08/1993 26/05/2012

Output 2

author_name.first author_name.last_name info.created_at info.last_updated
0 Smith Appleseed nan nan
1 Jane Doe 11/08/1993 26/05/2012

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-104-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added the IO JSON read_json, to_json, json_normalize label Mar 8, 2018
@gfyoung
Copy link
Member

gfyoung commented Mar 8, 2018

I would personally prefer the second option. Feel free to post the PR!

@gfyoung gfyoung added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO JSON read_json, to_json, json_normalize and removed IO JSON read_json, to_json, json_normalize labels Mar 8, 2018
@aerymilts
Copy link
Contributor Author

@gfyoung Were you referring to opening this as two separate issues?

@gfyoung
Copy link
Member

gfyoung commented Mar 18, 2018

@aerymilts : I was referring to the second output being preferable to the first.

aerymilts added a commit to aerymilts/pandas that referenced this issue Mar 18, 2018
…0030)

TST: additional coverage for the test cases from (pandas-dev#20030)

DOC: added changes to whatsnew/v0.23.0.txt (pandas-dev#20030)
@jreback jreback added this to the 0.23.0 milestone Mar 20, 2018
nehiljain added a commit to nehiljain/pandas that referenced this issue Mar 21, 2018
…ame_describe

* upstream/master: (158 commits)
  Add link to "Craft Minimal Bug Report" blogpost (pandas-dev#20431)
  BUG: fixed json_normalize for subrecords with NoneTypes (pandas-dev#20030) (pandas-dev#20399)
  BUG: ExtensionArray.fillna for scalar values (pandas-dev#20412)
  DOC" update the Pandas core window rolling count docstring" (pandas-dev#20264)
  DOC: update the pandas.DataFrame.plot.hist docstring (pandas-dev#20155)
  DOC: Only use ~ in class links to hide prefixes. (pandas-dev#20402)
  Bug: Allow np.timedelta64 objects to index TimedeltaIndex (pandas-dev#20408)
  DOC: add disallowing of Series construction of len-1 list with index to whatsnew (pandas-dev#20392)
  MAINT: Remove weird pd file
  DOC: update the Index.isin docstring (pandas-dev#20249)
  BUG: Handle all-NA blocks in concat (pandas-dev#20382)
  DOC: update the pandas.core.resample.Resampler.fillna docstring (pandas-dev#20379)
  BUG: Don't raise exceptions splitting a blank string (pandas-dev#20067)
  DOC: update the pandas.DataFrame.cummax docstring (pandas-dev#20336)
  DOC: update the pandas.core.window.x.mean docstring (pandas-dev#20265)
  DOC: update the api.types.is_number docstring (pandas-dev#20196)
  Fix linter (pandas-dev#20389)
  DOC: Improved the docstring of pandas.Series.dt.to_pytimedelta (pandas-dev#20142)
  DOC: update the pandas.Series.dt.is_month_end docstring (pandas-dev#20181)
  DOC: update the window.Rolling.min docstring (pandas-dev#20263)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants