Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid PyArrow type optimization if it fails #3234

Merged
merged 10 commits into from
Nov 10, 2021
Merged

Avoid PyArrow type optimization if it fails #3234

merged 10 commits into from
Nov 10, 2021

Conversation

mariosasko
Copy link
Collaborator

@mariosasko mariosasko commented Nov 8, 2021

Adds a new variable, DISABLE_PYARROW_TYPES_OPTIMIZATION, to config.py for easier control of the Arrow type optimization.

Fix #2206

@lhoestq
Copy link
Member

lhoestq commented Nov 8, 2021

That's good to have a way to disable this easily :)
I just find it a bit unfortunate that users would have to experience the error once and then do DISABLE_PYARROW_TYPES_OPTIMIZATION=1. Do you know if there's a way to simply fallback on disabling it automatically when it fails ?

@mariosasko
Copy link
Collaborator Author

@lhoestq Actually, I agree a fallback makes more sense. The current approach is not very practical indeed and would require a mention in the docs.

@mariosasko
Copy link
Collaborator Author

mariosasko commented Nov 8, 2021

Replaced the env variable with a fallback!

@mariosasko mariosasko changed the title Add option to disable pyarrow type optimization Avoid PyArrow type optimization if it fails Nov 8, 2021
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job ! I think this could also be part of a documentation page about "Processing text data" in an optimization section cc @stevhliu

src/datasets/arrow_writer.py Show resolved Hide resolved
src/datasets/arrow_writer.py Show resolved Hide resolved
@stevhliu
Copy link
Member

stevhliu commented Nov 9, 2021

Hmm if the fallback automatically happens without the user knowing it, then I don't think we really need to mention it. But if you really wanted to, I think the Improve performance section would be a great place for it!

@lhoestq
Copy link
Member

lhoestq commented Nov 10, 2021

Yea I think this could just end up in a note that says that datasets automatically picks the most optimized integer precision for your tokenized text data to save you disk space. Maybe later if we have a page on text processing we can add this note, but for now I agree it doesn't fit well into the doc.

In particular in the "Improve performance" section we mention what users can do to speed up their computations, while this behavior is just some internal feature that users don't have control over anyway.

@lhoestq lhoestq merged commit 807341d into master Nov 10, 2021
@lhoestq lhoestq deleted the fix-2206 branch November 10, 2021 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Got pyarrow error when loading a dataset while adding special tokens into the tokenizer
3 participants