Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Optionally pass dtypes as a dict into json_normalize #33414

Open
jcezarms opened this issue Apr 9, 2020 · 3 comments
Open

ENH: Optionally pass dtypes as a dict into json_normalize #33414

jcezarms opened this issue Apr 9, 2020 · 3 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO JSON read_json, to_json, json_normalize

Comments

@jcezarms
Copy link

jcezarms commented Apr 9, 2020

Is your feature request related to a problem?

In some of my projects, the data I need to load as DataFrame comes in a json list.

We switched to json_normalize for its ease of use, but some memory/performance issues came as the default typing within this method outputs mostly object, int64 and float64 column dtypes for us - all of which are the most memory-demanding of their categories.

Describe the solution you'd like

Depends on #4464.
I envision passing dtype as a dict, so for a format like

json  = [{
    "nest": {
            "int": 10,
            "float": 2.5,
            "str": "category_one"
    }
}]

Normalizing would look like

dtype = {'nest.int': np.int8, 'nest.float': np.float32, 'next.str': 'category'}
df = pd.json_normalize(json, dtype=dtype)

Following the conclusion of #4464, the dtype arg would be passed to the DataFrame constructor within json_normalize.

API breaking implications

As of now, the code itself mentions a problem regarding metadata field typing.

I didn't dive in enough to determine how to deal with this. I see no breaking change coming from this feature, but if I understand correctly that will always override metadata field types with object.

If so the condition could be changed to

if k in result and k not in dtype.keys(): 

Describe alternatives you've considered

I've made a separate module which runs json_normalize and then overrides the resulting DataFrame's dtypes dynamically through as_type and apply.

@jcezarms jcezarms added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 9, 2020
mproszewska added a commit to mproszewska/pandas that referenced this issue Apr 24, 2020
@mproszewska
Copy link
Contributor

I attempted to write code for this and referenced this issue in commit.
I changed loading variables into dictionary, so that variables in dictionary are in desired dtype if possible ("category" is str). However after passing this to DataFrame constructor, still dtype is inferred and np.int8 becomes int64 etc. Calling astype after creating DataFrame with wrong types won't solve problem with memory usage.
Maybe creating each Series separately would be a better approach? or maybe someone else has a better idea?

@jbrockmendel jbrockmendel added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2020
@tyler555g
Copy link

Is there a workaround for this? I would prefer to just have everything stay as string. This causes issues when trying to interpret data where leading zeroes are important.

@claresloggett
Copy link

claresloggett commented Aug 2, 2024

We've encountered this issue too: reading in JSON that contains integer values, with some missing, results in the ints being forced to floats (since there is no NaN for ints) and Pandas rendering them like 5.0 instead of 5.

Usually, when creating a dataframe, this can be prevented by setting dtype=object, but since json_normalize() doesn't pass dtype down, we can't currently prevent this behaviour when using json_normalize(). Converting the type afterwards isn't a fix for this problem.

Interestingly, in the very first example in the json_normalize() docs at https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html, this issue is visible - in the little example data with family names, data.id has been coerced to a float and the IDs are presented as 1.0, NaN, 2.0, which is likely not what would be wanted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

6 participants