ENH: Optionally pass dtypes as a dict into json_normalize #33414

jcezarms · 2020-04-09T02:42:32Z

Is your feature request related to a problem?

In some of my projects, the data I need to load as DataFrame comes in a json list.

We switched to json_normalize for its ease of use, but some memory/performance issues came as the default typing within this method outputs mostly object, int64 and float64 column dtypes for us - all of which are the most memory-demanding of their categories.

Describe the solution you'd like

Depends on #4464.
I envision passing dtype as a dict, so for a format like

json  = [{
    "nest": {
            "int": 10,
            "float": 2.5,
            "str": "category_one"
    }
}]

Normalizing would look like

dtype = {'nest.int': np.int8, 'nest.float': np.float32, 'next.str': 'category'}
df = pd.json_normalize(json, dtype=dtype)

Following the conclusion of #4464, the dtype arg would be passed to the DataFrame constructor within json_normalize.

API breaking implications

As of now, the code itself mentions a problem regarding metadata field typing.

I didn't dive in enough to determine how to deal with this. I see no breaking change coming from this feature, but if I understand correctly that will always override metadata field types with object.

If so the condition could be changed to

if k in result and k not in dtype.keys():

Describe alternatives you've considered

I've made a separate module which runs json_normalize and then overrides the resulting DataFrame's dtypes dynamically through as_type and apply.

The text was updated successfully, but these errors were encountered:

mproszewska · 2020-04-24T22:54:13Z

I attempted to write code for this and referenced this issue in commit.
I changed loading variables into dictionary, so that variables in dictionary are in desired dtype if possible ("category" is str). However after passing this to DataFrame constructor, still dtype is inferred and np.int8 becomes int64 etc. Calling astype after creating DataFrame with wrong types won't solve problem with memory usage.
Maybe creating each Series separately would be a better approach? or maybe someone else has a better idea?

tyler555g · 2020-10-09T00:13:03Z

Is there a workaround for this? I would prefer to just have everything stay as string. This causes issues when trying to interpret data where leading zeroes are important.

claresloggett · 2024-08-02T01:30:18Z

We've encountered this issue too: reading in JSON that contains integer values, with some missing, results in the ints being forced to floats (since there is no NaN for ints) and Pandas rendering them like 5.0 instead of 5.

Usually, when creating a dataframe, this can be prevented by setting dtype=object, but since json_normalize() doesn't pass dtype down, we can't currently prevent this behaviour when using json_normalize(). Converting the type afterwards isn't a fix for this problem.

Interestingly, in the very first example in the json_normalize() docs at https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html, this issue is visible - in the little example data with family names, data.id has been coerced to a float and the IDs are presented as 1.0, NaN, 2.0, which is likely not what would be wanted!

jcezarms added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2020

mproszewska added a commit to mproszewska/pandas that referenced this issue Apr 24, 2020

First attempt to issue pandas-dev#33414

6344efd

jbrockmendel added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2020

simonjayhawkins added the Dtype Conversions Unexpected or buggy dtype conversions label Jun 3, 2022

simonjayhawkins mentioned this issue Jun 3, 2022

BUG: json_normalize() upcasts column with missing values #37935

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Optionally pass dtypes as a dict into json_normalize #33414

ENH: Optionally pass dtypes as a dict into json_normalize #33414

jcezarms commented Apr 9, 2020

mproszewska commented Apr 24, 2020

tyler555g commented Oct 9, 2020

claresloggett commented Aug 2, 2024 •

edited

Loading

ENH: Optionally pass dtypes as a dict into json_normalize #33414

ENH: Optionally pass dtypes as a dict into json_normalize #33414

Comments

jcezarms commented Apr 9, 2020

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

mproszewska commented Apr 24, 2020

tyler555g commented Oct 9, 2020

claresloggett commented Aug 2, 2024 • edited Loading

claresloggett commented Aug 2, 2024 •

edited

Loading