Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix methods using IterableDataset.map that lead to features=None #5287

Merged
merged 8 commits into from
Nov 28, 2022
Merged

Fix methods using IterableDataset.map that lead to features=None #5287

merged 8 commits into from
Nov 28, 2022

Conversation

alvarobartt
Copy link
Member

@alvarobartt alvarobartt commented Nov 23, 2022

As currently IterableDataset.map is setting the info.features to None every time as we don't know the output of the dataset in advance, IterableDataset methods such as rename_column, rename_columns, and remove_columns. that internally use map lead to the features being None.

This PR is related to #3888, #5245, and #5284

✅ Current solution

The code in this PR is basically making sure that if the features were there since the beginning and a rename_column/rename_columns happens, those are kept and the rename is applied to the Features too. Also, if the features were not there before applying rename_column, rename_columns or remove_columns, a batch is prefetched and the features are being inferred (that could potentially be part of IterableDataset.__init__ in case the info.features value is None).

💡 Ideas

Some ideas were proposed in #3888, but probably the most consistent solution even though it may take some time is to actually do the type inferencing during the IterableDataset.__init__ in case the provided info.features is None, otherwise, we can just use the provided features.

Additionally, as mentioned at #3888, we could also include a features parameter to the map function, but that's probably more tedious.

Also thanks to @lhoestq for sharing some ideas in both #3888 and #5245 🤗

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 23, 2022

The documentation is not available anymore as the PR was closed or merged.

@alvarobartt alvarobartt deleted the rename-column-iterable-ds branch November 23, 2022 16:25
@alvarobartt alvarobartt restored the rename-column-iterable-ds branch November 23, 2022 16:26
@alvarobartt alvarobartt reopened this Nov 23, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 23, 2022

The documentation is not available anymore as the PR was closed or merged.

@alvarobartt
Copy link
Member Author

Maybe other options are:

  • Keep the info.features to None if those were initially None
  • Infer the features with pre-fetching just if the info.features is None
  • If the info.features are there, make sure that after map features is not None

Same fix as previously done with `IterableDataset.rename_column/s`, which was setting the `features=None` at the end
@alvarobartt
Copy link
Member Author

Hi @lhoestq something that's still not clear to me is: should we infer the features always when applying a map if those are initially None, or just assume that if the features are initially None those should be left that way unless the user specifically sets those (or during iter)?

In this PR I'm using from datasets.iterable_dataset import _infer_features_from_batch to infer the features when those are None using pre-fetch of self._head(), but I'm not sure if that's the expected behavior.

Thanks in advance for your help!

@alvarobartt
Copy link
Member Author

Also, the PR still has some more work to do, but probably the most relevant thing to fix right now is that the features are being set to None in the functions IterableDataset.rename_column, IterableDataset.rename_columns, and IterableDataset.remove_columns when the features originally had a value. So once that's fixed maybe we can focus on improving the current map's behavior, so as to avoid this from happening also when the user uses map directly and not through the functions mentioned above.

Note that the assertions are based on the `Feature` inference being done over a batch when `features=None` which maybe it's not the ideal scenario, TBD in the PR
@alvarobartt alvarobartt changed the title [WIP] Fix methods using IterableDataset.map that lead to features=None Fix methods using IterableDataset.map that lead to features=None Nov 26, 2022
@alvarobartt alvarobartt marked this pull request as ready for review November 26, 2022 09:57
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thank you ! Resolving the features can be expensive sometimes, so maybe we don't resolve the features and we can just rename/remove columns if the features are known (i.e. if they're not None). What do you think ?

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved
src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved
src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved
tests/test_iterable_dataset.py Show resolved Hide resolved
tests/test_iterable_dataset.py Show resolved Hide resolved
tests/test_iterable_dataset.py Outdated Show resolved Hide resolved
tests/test_iterable_dataset.py Show resolved Hide resolved
@alvarobartt
Copy link
Member Author

Cool thank you ! Resolving the features can be expensive sometimes, so maybe we don't resolve the features and we can just rename/remove columns if the features are known (i.e. if they're not None). What do you think ?

Thanks for the feedback! Makes sense to me 👍🏻 I'll commit the comments now!

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
@alvarobartt
Copy link
Member Author

Already done @lhoestq, feel free to merge whenever you want! Also before merging, can you please link the following issues #3888, #5245, and #5284, so that those are closed upon merge? Thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you !

I'll keep #3888 open since it's not fully completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Features of IterableDataset set to None by remove column Unable to rename columns in streaming dataset
3 participants