Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient nested features encoding #3124

Merged
merged 8 commits into from
Nov 2, 2021

Conversation

eladsegal
Copy link
Contributor

Nested encoding of features wastes a lot of time on operations which are effectively doing nothing when lists are used.
For example, if in the input we have a list of integers, encoded_nested_example will iterate over it and apply encoded_nested_example on every element even though it just return the int as is.

A similar issue is handled at an earlier stage when casting pytorch/tensorflow/pandas objects to python lists/numpy arrays:

def _cast_to_python_objects(obj: Any, only_1d_for_numpy: bool) -> Tuple[Any, bool]:
"""
Cast pytorch/tensorflow/pandas objects to python numpy array/lists.
It works recursively.
To avoid iterating over possibly long lists, it first checks if the first element that is not None has to be casted.
If the first element needs to be casted, then all the elements of the list will be casted, otherwise they'll stay the same.
This trick allows to cast objects that contain tokenizers outputs without iterating over every single token for example.

elif isinstance(obj, (list, tuple)):
if len(obj) > 0:
for first_elmt in obj:
if first_elmt is not None:
break
casted_first_elmt, has_changed_first_elmt = _cast_to_python_objects(
first_elmt, only_1d_for_numpy=only_1d_for_numpy
)
if has_changed_first_elmt:
return [_cast_to_python_objects(elmt, only_1d_for_numpy=only_1d_for_numpy)[0] for elmt in obj], True
else:
if isinstance(obj, list):
return obj, False
else:
return list(obj), True
else:
return obj if isinstance(obj, list) else [], isinstance(obj, tuple)

In this pull request I suggest to use the same approach in encoded_nested_example.
In my setup there was a major speedup with this change: loading the data was at least x4 faster.

@eladsegal eladsegal marked this pull request as draft October 21, 2021 02:03
@eladsegal eladsegal marked this pull request as ready for review October 29, 2021 14:06
@eladsegal
Copy link
Contributor Author

@lhoestq @albertvillanova @mariosasko
Can you please check this out?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! Good idea indeed :)

Could you mention this optimization in the docstring of encode_nested_example ?

@eladsegal
Copy link
Contributor Author

Thanks, done!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing thanks !

@lhoestq lhoestq merged commit 69e8795 into huggingface:master Nov 2, 2021
@eladsegal eladsegal deleted the efficient-encode-nested-pr branch November 2, 2021 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants