Allow JSON encoder to handle ndarray #777

srowen · 2024-09-07T17:41:57Z

Description of changes:

The JSON encoder encodes a variety of data types via json.dumps, including primitives, lists, dicts, etc.
However json.dumps doesn't work with ndarrays ; it won't serialize them.
Conceptually, there isn't a strong reason to refuse to encode input as JSON where ndarray is used instead of lists, when one could simply transform even multi-dimensional ndarrays to supported Python lists with .tolist().

This change simply has the encoder call .tolist() on its arg if passed an ndarray, and the rest works as expected.

This is actually relevant because the serialization of array type data from Spark to pandas will use ndarrays (being based on Arrow), and the result can't be passed to dataframe_to_mds even if the output type is given as 'json'. It's possible to workaround by passing a transformation of the pandas DF to this function. But, it seemed simple and natural to just support this directly.

Merge Checklist:

General

I have read the contributor guidelines
This is a documentation change or typo fix. If so, skip the rest of this checklist.
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the MosaicML team.
I have updated any necessary documentation, including README and API docs (if appropriate).

Tests

I ran pre-commit on my change. (check out the pre-commit section of prerequisites)
I have added tests that prove my fix is effective or that my feature works (if appropriate).
I ran the tests locally to make sure it pass. (check out testing)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes.

snarayan21

why not just use the ndarray encoding type?

streaming/base/format/mds/encodings.py

snarayan21

Lgtm

Allow JSON encoder to handle ndarray

05395eb

srowen added the enhancement New feature or request label Sep 7, 2024

snarayan21 reviewed Sep 9, 2024

View reviewed changes

streaming/base/format/mds/encodings.py Show resolved Hide resolved

snarayan21 approved these changes Sep 9, 2024

View reviewed changes

snarayan21 added 2 commits September 9, 2024 10:55

Merge branch 'main' into ndarray_json

d7b5167

Merge branch 'main' into ndarray_json

aab9000

snarayan21 enabled auto-merge (squash) September 9, 2024 18:18

snarayan21 merged commit 06fd29f into mosaicml:main Sep 9, 2024
7 checks passed

srowen deleted the ndarray_json branch September 9, 2024 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow JSON encoder to handle ndarray #777

Allow JSON encoder to handle ndarray #777

srowen commented Sep 7, 2024

snarayan21 left a comment

snarayan21 left a comment

Allow JSON encoder to handle ndarray #777

Allow JSON encoder to handle ndarray #777

Conversation

srowen commented Sep 7, 2024

Description of changes:

Merge Checklist:

General

Tests

snarayan21 left a comment

Choose a reason for hiding this comment

snarayan21 left a comment

Choose a reason for hiding this comment