Skip to content

Commit

Permalink
Make IterableDataset decode example
Browse files Browse the repository at this point in the history
  • Loading branch information
albertvillanova committed Oct 21, 2021
1 parent 260b85c commit 99305c1
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion src/datasets/iterable_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -341,7 +341,10 @@ def __iter__(self):
for key, example in self._iter():
if self.features:
# we encode the example for ClassLabel feature types for example
yield self.features.encode_example(example)
encoded_example = self.features.encode_example(example)
# Decode example for Audio feature, e.g.
decoded_example = self.features.decode_example(encoded_example)
yield decoded_example
else:
yield example

Expand Down

1 comment on commit 99305c1

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009339 / 0.011353 (-0.002014) 0.003744 / 0.011008 (-0.007264) 0.031991 / 0.038508 (-0.006517) 0.036846 / 0.023109 (0.013737) 0.316619 / 0.275898 (0.040721) 0.431654 / 0.323480 (0.108174) 0.008041 / 0.007986 (0.000055) 0.004672 / 0.004328 (0.000343) 0.009192 / 0.004250 (0.004941) 0.039428 / 0.037052 (0.002376) 0.316193 / 0.258489 (0.057704) 0.359337 / 0.293841 (0.065496) 0.023970 / 0.128546 (-0.104576) 0.008141 / 0.075646 (-0.067506) 0.255141 / 0.419271 (-0.164130) 0.046428 / 0.043533 (0.002896) 0.303632 / 0.255139 (0.048493) 0.344688 / 0.283200 (0.061488) 0.082380 / 0.141683 (-0.059302) 1.757912 / 1.452155 (0.305757) 1.789208 / 1.492716 (0.296492)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.213455 / 0.018006 (0.195448) 0.447543 / 0.000490 (0.447053) 0.006440 / 0.000200 (0.006240) 0.000369 / 0.000054 (0.000315)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.036612 / 0.037411 (-0.000800) 0.023719 / 0.014526 (0.009193) 0.029466 / 0.176557 (-0.147091) 0.124484 / 0.737135 (-0.612651) 0.031198 / 0.296338 (-0.265140)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.414859 / 0.215209 (0.199650) 4.146393 / 2.077655 (2.068738) 1.806400 / 1.504120 (0.302280) 1.603360 / 1.541195 (0.062165) 1.668527 / 1.468490 (0.200037) 0.373918 / 4.584777 (-4.210859) 4.731675 / 3.745712 (0.985963) 0.947099 / 5.269862 (-4.322763) 0.843218 / 4.565676 (-3.722458) 0.041591 / 0.424275 (-0.382684) 0.004899 / 0.007607 (-0.002709) 0.515158 / 0.226044 (0.289113) 5.146732 / 2.268929 (2.877804) 2.223654 / 55.444624 (-53.220971) 1.897856 / 6.876477 (-4.978620) 1.949211 / 2.142072 (-0.192862) 0.481491 / 4.805227 (-4.323736) 0.103920 / 6.500664 (-6.396745) 0.053232 / 0.075469 (-0.022237)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.569374 / 1.841788 (-0.272413) 13.287128 / 8.074308 (5.212820) 26.584053 / 10.191392 (16.392661) 0.747845 / 0.680424 (0.067421) 0.518758 / 0.534201 (-0.015443) 0.227040 / 0.579283 (-0.352243) 0.512819 / 0.434364 (0.078455) 0.179627 / 0.540337 (-0.360710) 0.190799 / 1.386936 (-1.196137)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009572 / 0.011353 (-0.001781) 0.003907 / 0.011008 (-0.007101) 0.031707 / 0.038508 (-0.006801) 0.035119 / 0.023109 (0.012009) 0.278723 / 0.275898 (0.002825) 0.312762 / 0.323480 (-0.010718) 0.008082 / 0.007986 (0.000096) 0.003454 / 0.004328 (-0.000875) 0.009204 / 0.004250 (0.004954) 0.044892 / 0.037052 (0.007840) 0.277815 / 0.258489 (0.019326) 0.322631 / 0.293841 (0.028790) 0.024562 / 0.128546 (-0.103984) 0.008270 / 0.075646 (-0.067376) 0.255003 / 0.419271 (-0.164268) 0.048862 / 0.043533 (0.005329) 0.274237 / 0.255139 (0.019098) 0.302417 / 0.283200 (0.019217) 0.093125 / 0.141683 (-0.048558) 1.727928 / 1.452155 (0.275774) 1.813764 / 1.492716 (0.321048)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.323680 / 0.018006 (0.305674) 0.444272 / 0.000490 (0.443782) 0.060833 / 0.000200 (0.060633) 0.000459 / 0.000054 (0.000405)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.034827 / 0.037411 (-0.002585) 0.021462 / 0.014526 (0.006936) 0.027623 / 0.176557 (-0.148934) 0.126156 / 0.737135 (-0.610979) 0.029014 / 0.296338 (-0.267325)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.417104 / 0.215209 (0.201895) 4.156437 / 2.077655 (2.078782) 1.790653 / 1.504120 (0.286534) 1.583524 / 1.541195 (0.042329) 1.633973 / 1.468490 (0.165483) 0.380049 / 4.584777 (-4.204728) 4.705585 / 3.745712 (0.959873) 0.894151 / 5.269862 (-4.375711) 0.833333 / 4.565676 (-3.732344) 0.041513 / 0.424275 (-0.382762) 0.004878 / 0.007607 (-0.002729) 0.516371 / 0.226044 (0.290326) 5.144846 / 2.268929 (2.875917) 2.234894 / 55.444624 (-53.209730) 1.868558 / 6.876477 (-5.007918) 1.909478 / 2.142072 (-0.232594) 0.481106 / 4.805227 (-4.324121) 0.101860 / 6.500664 (-6.398804) 0.051928 / 0.075469 (-0.023541)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.539157 / 1.841788 (-0.302631) 13.180135 / 8.074308 (5.105827) 26.960314 / 10.191392 (16.768922) 0.792988 / 0.680424 (0.112564) 0.525640 / 0.534201 (-0.008561) 0.229156 / 0.579283 (-0.350127) 0.514017 / 0.434364 (0.079653) 0.189914 / 0.540337 (-0.350423) 0.209370 / 1.386936 (-1.177566)

CML watermark

Please sign in to comment.