Fix methods using `IterableDataset.map` that lead to `features=None` #5287

alvarobartt · 2022-11-23T15:33:25Z

As currently IterableDataset.map is setting the info.features to None every time as we don't know the output of the dataset in advance, IterableDataset methods such as rename_column, rename_columns, and remove_columns. that internally use map lead to the features being None.

This PR is related to #3888, #5245, and #5284

✅ Current solution

The code in this PR is basically making sure that if the features were there since the beginning and a rename_column/rename_columns happens, those are kept and the rename is applied to the Features too. Also, if the features were not there before applying rename_column, rename_columns or remove_columns, a batch is prefetched and the features are being inferred (that could potentially be part of IterableDataset.__init__ in case the info.features value is None).

💡 Ideas

Some ideas were proposed in #3888, but probably the most consistent solution even though it may take some time is to actually do the type inferencing during the IterableDataset.__init__ in case the provided info.features is None, otherwise, we can just use the provided features.

Additionally, as mentioned at #3888, we could also include a features parameter to the map function, but that's probably more tedious.

Also thanks to @lhoestq for sharing some ideas in both #3888 and #5245 🤗

HuggingFaceDocBuilderDev · 2022-11-23T15:38:32Z

The documentation is not available anymore as the PR was closed or merged.

HuggingFaceDocBuilderDev · 2022-11-23T16:32:58Z

The documentation is not available anymore as the PR was closed or merged.

alvarobartt · 2022-11-24T08:44:02Z

Maybe other options are:

Keep the info.features to None if those were initially None
Infer the features with pre-fetching just if the info.features is None
If the info.features are there, make sure that after map features is not None

Same fix as previously done with `IterableDataset.rename_column/s`, which was setting the `features=None` at the end

alvarobartt · 2022-11-24T16:04:16Z

Hi @lhoestq something that's still not clear to me is: should we infer the features always when applying a map if those are initially None, or just assume that if the features are initially None those should be left that way unless the user specifically sets those (or during iter)?

In this PR I'm using from datasets.iterable_dataset import _infer_features_from_batch to infer the features when those are None using pre-fetch of self._head(), but I'm not sure if that's the expected behavior.

Thanks in advance for your help!

alvarobartt · 2022-11-24T16:09:19Z

Also, the PR still has some more work to do, but probably the most relevant thing to fix right now is that the features are being set to None in the functions IterableDataset.rename_column, IterableDataset.rename_columns, and IterableDataset.remove_columns when the features originally had a value. So once that's fixed maybe we can focus on improving the current map's behavior, so as to avoid this from happening also when the user uses map directly and not through the functions mentioned above.

Note that the assertions are based on the `Feature` inference being done over a batch when `features=None` which maybe it's not the ideal scenario, TBD in the PR

lhoestq

Cool thank you ! Resolving the features can be expensive sometimes, so maybe we don't resolve the features and we can just rename/remove columns if the features are known (i.e. if they're not None). What do you think ?

src/datasets/iterable_dataset.py

tests/test_iterable_dataset.py

alvarobartt · 2022-11-28T12:10:26Z

Cool thank you ! Resolving the features can be expensive sometimes, so maybe we don't resolve the features and we can just rename/remove columns if the features are known (i.e. if they're not None). What do you think ?

Thanks for the feedback! Makes sense to me 👍🏻 I'll commit the comments now!

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

alvarobartt · 2022-11-28T12:13:18Z

Already done @lhoestq, feel free to merge whenever you want! Also before merging, can you please link the following issues #3888, #5245, and #5284, so that those are closed upon merge? Thanks!

lhoestq

Awesome thank you !

I'll keep #3888 open since it's not fully completed

alvarobartt added 2 commits November 16, 2022 09:33

Fix example as rename_column isn't inplace

f1e1af5

Fix IterableDataset.rename_column/s

fa2f876

alvarobartt added 3 commits November 23, 2022 16:48

Copy self._info.features just if those exist

335823f

Take ds_iterable._head when inferring features

e34c27b

Add regression test to check features is not None

1328ed7

alvarobartt closed this Nov 23, 2022

alvarobartt deleted the rename-column-iterable-ds branch November 23, 2022 16:25

alvarobartt restored the rename-column-iterable-ds branch November 23, 2022 16:26

alvarobartt reopened this Nov 23, 2022

Fix IterableDataset.remove_columns

b658974

Same fix as previously done with `IterableDataset.rename_column/s`, which was setting the `features=None` at the end

Add IterableDataset.remove_columns tests

06fe167

Note that the assertions are based on the `Feature` inference being done over a batch when `features=None` which maybe it's not the ideal scenario, TBD in the PR

alvarobartt changed the title ~~[WIP] Fix methods using IterableDataset.map that lead to features=None~~ Fix methods using IterableDataset.map that lead to features=None Nov 26, 2022

alvarobartt marked this pull request as ready for review November 26, 2022 09:57

alvarobartt mentioned this pull request Nov 26, 2022

Features of IterableDataset set to None by remove column #5284

Closed

lhoestq reviewed Nov 28, 2022

View reviewed changes

Apply suggestions from code review

7c08b62

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

This was linked to issues Nov 28, 2022

Unable to rename columns in streaming dataset #5245

Closed

Features of IterableDataset set to None by remove column #5284

Closed

lhoestq approved these changes Nov 28, 2022

View reviewed changes

lhoestq merged commit c0bec7d into huggingface:main Nov 28, 2022

alvarobartt deleted the rename-column-iterable-ds branch November 28, 2022 15:43

alvarobartt mentioned this pull request Nov 29, 2022

Add features param to IterableDataset.map #5311

Merged

Principles0 mentioned this pull request Dec 14, 2022

Using rename_column and remove_column method for a IterableDataset object leads to its feature property become None --- in the Whisper Fine-Tuning Event huggingface/community-events#97

Closed

polinaeterna mentioned this pull request Dec 16, 2022

Wrong dtype for array in audio features #5345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix methods using `IterableDataset.map` that lead to `features=None` #5287

Fix methods using `IterableDataset.map` that lead to `features=None` #5287

alvarobartt commented Nov 23, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 23, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 23, 2022 •

edited

Loading

alvarobartt commented Nov 24, 2022

alvarobartt commented Nov 24, 2022

alvarobartt commented Nov 24, 2022

lhoestq left a comment

alvarobartt commented Nov 28, 2022

alvarobartt commented Nov 28, 2022

lhoestq left a comment

Fix methods using IterableDataset.map that lead to features=None #5287

Fix methods using IterableDataset.map that lead to features=None #5287

Conversation

alvarobartt commented Nov 23, 2022 • edited Loading

✅ Current solution

💡 Ideas

HuggingFaceDocBuilderDev commented Nov 23, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Nov 23, 2022 • edited Loading

alvarobartt commented Nov 24, 2022

alvarobartt commented Nov 24, 2022

alvarobartt commented Nov 24, 2022

lhoestq left a comment

Choose a reason for hiding this comment

alvarobartt commented Nov 28, 2022

alvarobartt commented Nov 28, 2022

lhoestq left a comment

Choose a reason for hiding this comment

Fix methods using `IterableDataset.map` that lead to `features=None` #5287

Fix methods using `IterableDataset.map` that lead to `features=None` #5287

alvarobartt commented Nov 23, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 23, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 23, 2022 •

edited

Loading