More efficient nested features encoding #3124

eladsegal · 2021-10-21T01:55:31Z

Nested encoding of features wastes a lot of time on operations which are effectively doing nothing when lists are used.
For example, if in the input we have a list of integers, encoded_nested_example will iterate over it and apply encoded_nested_example on every element even though it just return the int as is.

A similar issue is handled at an earlier stage when casting pytorch/tensorflow/pandas objects to python lists/numpy arrays:

datasets/src/datasets/features/features.py

Lines 149 to 156 in c98c23c

    
           def _cast_to_python_objects(obj: Any, only_1d_for_numpy: bool) -> Tuple[Any, bool]: 
        
               """ 
        
               Cast pytorch/tensorflow/pandas objects to python numpy array/lists. 
        
               It works recursively. 
        
               To avoid iterating over possibly long lists, it first checks if the first element that is not None has to be casted. 
        
               If the first element needs to be casted, then all the elements of the list will be casted, otherwise they'll stay the same. 
        
               This trick allows to cast objects that contain tokenizers outputs without iterating over every single token for example.

datasets/src/datasets/features/features.py

Lines 212 to 228 in c98c23c

    
           elif isinstance(obj, (list, tuple)): 
        
               if len(obj) > 0: 
        
                   for first_elmt in obj: 
        
                       if first_elmt is not None: 
        
                           break 
        
                   casted_first_elmt, has_changed_first_elmt = _cast_to_python_objects( 
        
                       first_elmt, only_1d_for_numpy=only_1d_for_numpy 
        
                   ) 
        
                   if has_changed_first_elmt: 
        
                       return [_cast_to_python_objects(elmt, only_1d_for_numpy=only_1d_for_numpy)[0] for elmt in obj], True 
        
                   else: 
        
                       if isinstance(obj, list): 
        
                           return obj, False 
        
                       else: 
        
                           return list(obj), True 
        
               else: 
        
                   return obj if isinstance(obj, list) else [], isinstance(obj, tuple)

In this pull request I suggest to use the same approach in encoded_nested_example.
In my setup there was a major speedup with this change: loading the data was at least x4 faster.

eladsegal · 2021-10-29T14:07:11Z

@lhoestq @albertvillanova @mariosasko
Can you please check this out?

lhoestq

Nice ! Good idea indeed :)

Could you mention this optimization in the docstring of encode_nested_example ?

eladsegal · 2021-10-29T16:14:45Z

Thanks, done!

lhoestq

Amazing thanks !

Update features.py

38a264e

eladsegal marked this pull request as draft October 21, 2021 02:03

eladsegal added 2 commits October 29, 2021 16:48

fix for empty lists

c848c86

fix syntax error

8d812e4

eladsegal marked this pull request as ready for review October 29, 2021 14:06

eladsegal added 2 commits October 29, 2021 17:09

remove unnecessary None check

a9adab3

Merge branch 'huggingface:master' into efficient-encode-nested-pr

09a7b82

lhoestq reviewed Oct 29, 2021

View reviewed changes

eladsegal added 3 commits October 29, 2021 18:30

add explanation in the docstring

a2bdb17

fix

1c8d793

black

7991a74

eladsegal requested a review from lhoestq October 29, 2021 16:15

lhoestq approved these changes Nov 2, 2021

View reviewed changes

lhoestq merged commit 69e8795 into huggingface:master Nov 2, 2021

eladsegal deleted the efficient-encode-nested-pr branch November 2, 2021 15:07

lhoestq mentioned this pull request Nov 2, 2021

Fix optimized encoding for arrays #3197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient nested features encoding #3124

More efficient nested features encoding #3124

eladsegal commented Oct 21, 2021

eladsegal commented Oct 29, 2021

lhoestq left a comment

eladsegal commented Oct 29, 2021

lhoestq left a comment

	def _cast_to_python_objects(obj: Any, only_1d_for_numpy: bool) -> Tuple[Any, bool]:
	"""
	Cast pytorch/tensorflow/pandas objects to python numpy array/lists.
	It works recursively.

	To avoid iterating over possibly long lists, it first checks if the first element that is not None has to be casted.
	If the first element needs to be casted, then all the elements of the list will be casted, otherwise they'll stay the same.
	This trick allows to cast objects that contain tokenizers outputs without iterating over every single token for example.

	elif isinstance(obj, (list, tuple)):
	if len(obj) > 0:
	for first_elmt in obj:
	if first_elmt is not None:
	break
	casted_first_elmt, has_changed_first_elmt = _cast_to_python_objects(
	first_elmt, only_1d_for_numpy=only_1d_for_numpy
	)
	if has_changed_first_elmt:
	return [_cast_to_python_objects(elmt, only_1d_for_numpy=only_1d_for_numpy)[0] for elmt in obj], True
	else:
	if isinstance(obj, list):
	return obj, False
	else:
	return list(obj), True
	else:
	return obj if isinstance(obj, list) else [], isinstance(obj, tuple)

More efficient nested features encoding #3124

More efficient nested features encoding #3124

Conversation

eladsegal commented Oct 21, 2021

eladsegal commented Oct 29, 2021

lhoestq left a comment

Choose a reason for hiding this comment

eladsegal commented Oct 29, 2021

lhoestq left a comment

Choose a reason for hiding this comment