-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix methods using IterableDataset.map
that lead to features=None
#5287
Fix methods using IterableDataset.map
that lead to features=None
#5287
Conversation
The documentation is not available anymore as the PR was closed or merged. |
The documentation is not available anymore as the PR was closed or merged. |
Maybe other options are:
|
Same fix as previously done with `IterableDataset.rename_column/s`, which was setting the `features=None` at the end
Hi @lhoestq something that's still not clear to me is: should we infer the features always when applying a In this PR I'm using Thanks in advance for your help! |
Also, the PR still has some more work to do, but probably the most relevant thing to fix right now is that the |
Note that the assertions are based on the `Feature` inference being done over a batch when `features=None` which maybe it's not the ideal scenario, TBD in the PR
IterableDataset.map
that lead to features=None
IterableDataset.map
that lead to features=None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thank you ! Resolving the features can be expensive sometimes, so maybe we don't resolve the features and we can just rename/remove columns if the features are known (i.e. if they're not None). What do you think ?
Thanks for the feedback! Makes sense to me 👍🏻 I'll commit the comments now! |
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you !
I'll keep #3888 open since it's not fully completed
As currently
IterableDataset.map
is setting theinfo.features
toNone
every time as we don't know the output of the dataset in advance,IterableDataset
methods such asrename_column
,rename_columns
, andremove_columns
. that internally usemap
lead to the features beingNone
.This PR is related to #3888, #5245, and #5284
✅ Current solution
The code in this PR is basically making sure that if the features were there since the beginning and a
rename_column
/rename_columns
happens, those are kept and the rename is applied to theFeatures
too. Also, if the features were not there before applyingrename_column
,rename_columns
orremove_columns
, a batch is prefetched and the features are being inferred (that could potentially be part ofIterableDataset.__init__
in case theinfo.features
value isNone
).💡 Ideas
Some ideas were proposed in #3888, but probably the most consistent solution even though it may take some time is to actually do the type inferencing during the
IterableDataset.__init__
in case the providedinfo.features
isNone
, otherwise, we can just use the provided features.Additionally, as mentioned at #3888, we could also include a
features
parameter to themap
function, but that's probably more tedious.Also thanks to @lhoestq for sharing some ideas in both #3888 and #5245 🤗