Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gzip for to_json #3492

Merged
merged 1 commit into from
Jan 5, 2022
Merged

Add gzip for to_json #3492

merged 1 commit into from
Jan 5, 2022

Conversation

bhavitvyamalik
Copy link
Contributor

(Partially) closes #3480. I have added gzip compression for to_json. I realised we can run into this compression problem with to_csv as well. IOHandler can be used for to_csv too. Please let me know if any changes are required.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you !

src/datasets/io/json.py Show resolved Hide resolved
import gzip


class IOHandler:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a naming suggestion, feel free to ignore if IOHandler sounds better to you

Suggested change
class IOHandler:
class open_compressed:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this could simply be a function that returns the output of open or gzip.open, what do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this but I think we'll need to close file too after writing which we aren't doing in json.py hence __exit__ thing in context manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOHandler is not just opening compressed files but uncompressed as well :p

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bhavitvyamalik!!! :)

Yes, indeed one of the advantages of handling directly the compression parameter ourselves, is that we can use it for other formats, like CSV or text (and maybe other future supported formats).

src/datasets/io/json.py Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good ! Thanks a lot @bhavitvyamalik :)

@lhoestq lhoestq merged commit 9f643b9 into huggingface:master Jan 5, 2022
@bhavitvyamalik bhavitvyamalik deleted the gzip branch July 10, 2022 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

the compression format requested when saving a dataset in json format is not respected
3 participants