-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient nested features encoding #3124
More efficient nested features encoding #3124
Conversation
@lhoestq @albertvillanova @mariosasko |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Good idea indeed :)
Could you mention this optimization in the docstring of encode_nested_example
?
Thanks, done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing thanks !
Nested encoding of features wastes a lot of time on operations which are effectively doing nothing when lists are used.
For example, if in the input we have a list of integers,
encoded_nested_example
will iterate over it and applyencoded_nested_example
on every element even though it just return the int as is.A similar issue is handled at an earlier stage when casting pytorch/tensorflow/pandas objects to python lists/numpy arrays:
datasets/src/datasets/features/features.py
Lines 149 to 156 in c98c23c
datasets/src/datasets/features/features.py
Lines 212 to 228 in c98c23c
In this pull request I suggest to use the same approach in
encoded_nested_example
.In my setup there was a major speedup with this change: loading the data was at least x4 faster.