Avoid PyArrow type optimization if it fails #3234

mariosasko · 2021-11-08T16:10:27Z

Adds a new variable, DISABLE_PYARROW_TYPES_OPTIMIZATION, to config.py for easier control of the Arrow type optimization.

Fix #2206

lhoestq · 2021-11-08T16:14:19Z

That's good to have a way to disable this easily :)
I just find it a bit unfortunate that users would have to experience the error once and then do DISABLE_PYARROW_TYPES_OPTIMIZATION=1. Do you know if there's a way to simply fallback on disabling it automatically when it fails ?

mariosasko · 2021-11-08T16:39:06Z

@lhoestq Actually, I agree a fallback makes more sense. The current approach is not very practical indeed and would require a mention in the docs.

mariosasko · 2021-11-08T18:07:20Z

Replaced the env variable with a fallback!

lhoestq

Good job ! I think this could also be part of a documentation page about "Processing text data" in an optimization section cc @stevhliu

src/datasets/arrow_writer.py

stevhliu · 2021-11-09T19:43:01Z

Hmm if the fallback automatically happens without the user knowing it, then I don't think we really need to mention it. But if you really wanted to, I think the Improve performance section would be a great place for it!

lhoestq · 2021-11-10T12:03:48Z

Yea I think this could just end up in a note that says that datasets automatically picks the most optimized integer precision for your tokenized text data to save you disk space. Maybe later if we have a page on text processing we can add this note, but for now I agree it doesn't fit well into the doc.

In particular in the "Improve performance" section we mention what users can do to speed up their computations, while this behavior is just some internal feature that users don't have control over anyway.

mariosasko added 4 commits November 8, 2021 02:40

Add option to disable type optimization

daaa0de

Add a test

e8f6fae

Add DISABLE prefix

fd3158d

Style

b2e1cc5

mariosasko added 5 commits November 8, 2021 18:05

Revert changes

65e80cb

Remove col in TypedSequence

9df8311

Add fallback in case of range error

f31593b

Add test

a5c3d8c

Fix

81e6bbe

mariosasko changed the title ~~Add option to disable pyarrow type optimization~~ Avoid PyArrow type optimization if it fails Nov 8, 2021

lhoestq approved these changes Nov 9, 2021

View reviewed changes

src/datasets/arrow_writer.py Show resolved Hide resolved

src/datasets/arrow_writer.py Show resolved Hide resolved

Log info message

f853475

lhoestq merged commit 807341d into master Nov 10, 2021

lhoestq deleted the fix-2206 branch November 10, 2021 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid PyArrow type optimization if it fails #3234

Avoid PyArrow type optimization if it fails #3234

mariosasko commented Nov 8, 2021 •

edited

Loading

lhoestq commented Nov 8, 2021

mariosasko commented Nov 8, 2021

mariosasko commented Nov 8, 2021 •

edited

Loading

lhoestq left a comment

stevhliu commented Nov 9, 2021

lhoestq commented Nov 10, 2021

Avoid PyArrow type optimization if it fails #3234

Avoid PyArrow type optimization if it fails #3234

Conversation

mariosasko commented Nov 8, 2021 • edited Loading

lhoestq commented Nov 8, 2021

mariosasko commented Nov 8, 2021

mariosasko commented Nov 8, 2021 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

stevhliu commented Nov 9, 2021

lhoestq commented Nov 10, 2021

mariosasko commented Nov 8, 2021 •

edited

Loading

mariosasko commented Nov 8, 2021 •

edited

Loading