Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IterableDataset Arrow formatting #5821

Merged
merged 11 commits into from
May 31, 2023
Merged

IterableDataset Arrow formatting #5821

merged 11 commits into from
May 31, 2023

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented May 4, 2023

Adding an optional .iter_arrow to examples iterable. This allows to use Arrow formatting in map/filter.

This will also be useful for torch formatting, since we can reuse the TorchFormatter that converts Arrow data to torch tensors

Related to #5793 and #3444

Required for #5852

Example:

Speed x10 in map

from datasets import Dataset
import pyarrow.compute as pc
import time


ds = Dataset.from_dict({"a": range(100_000)})


ids = ds.to_iterable_dataset()
ids = ids.map(lambda x: {"a": [a + 10 for a in x["a"]]}, batched=True)

_start = time.time()
print(f"Python ({sum(1 for _ in ids)} items):\t{(time.time() - _start) * 1000:.1f}ms")
# Python (100000 items):  695.7ms

ids = ds.to_iterable_dataset().with_format("arrow")
ids = ids.map(lambda t: t.set_column(0, "a", pc.add(t[0], 10)), batched=True)
ids = ids.with_format(None)

_start = time.time()
print(f"Arrow ({sum(1 for _ in ids)} items):\t{(time.time() - _start) * 1000:.1f}ms)")
# Arrow (100000 items):   81.0ms)

Implementation details

I added an optional iter_arrow method to examples iterable. If an example iterable has this method, then it can be used to iterate on the examples by batch of arrow tables.

@github-actions
Copy link

github-actions bot commented May 4, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007593 / 0.011353 (-0.003760) 0.005554 / 0.011008 (-0.005454) 0.097663 / 0.038508 (0.059155) 0.034915 / 0.023109 (0.011806) 0.303116 / 0.275898 (0.027218) 0.342376 / 0.323480 (0.018897) 0.006044 / 0.007986 (-0.001942) 0.004239 / 0.004328 (-0.000090) 0.074561 / 0.004250 (0.070310) 0.049109 / 0.037052 (0.012057) 0.311302 / 0.258489 (0.052813) 0.360717 / 0.293841 (0.066876) 0.035119 / 0.128546 (-0.093428) 0.012465 / 0.075646 (-0.063181) 0.333648 / 0.419271 (-0.085624) 0.051294 / 0.043533 (0.007762) 0.297298 / 0.255139 (0.042159) 0.321957 / 0.283200 (0.038757) 0.108206 / 0.141683 (-0.033477) 1.425023 / 1.452155 (-0.027132) 1.526395 / 1.492716 (0.033678)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.300694 / 0.018006 (0.282688) 0.515141 / 0.000490 (0.514651) 0.003965 / 0.000200 (0.003765) 0.000260 / 0.000054 (0.000206)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029428 / 0.037411 (-0.007983) 0.107634 / 0.014526 (0.093108) 0.123662 / 0.176557 (-0.052895) 0.182886 / 0.737135 (-0.554249) 0.128361 / 0.296338 (-0.167977)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.398809 / 0.215209 (0.183600) 3.984428 / 2.077655 (1.906773) 1.795337 / 1.504120 (0.291217) 1.609235 / 1.541195 (0.068040) 1.724825 / 1.468490 (0.256335) 0.698413 / 4.584777 (-3.886364) 3.857479 / 3.745712 (0.111767) 2.135203 / 5.269862 (-3.134659) 1.348458 / 4.565676 (-3.217218) 0.086445 / 0.424275 (-0.337830) 0.012717 / 0.007607 (0.005110) 0.498713 / 0.226044 (0.272668) 4.988685 / 2.268929 (2.719757) 2.284764 / 55.444624 (-53.159860) 1.961162 / 6.876477 (-4.915315) 2.147514 / 2.142072 (0.005441) 0.850334 / 4.805227 (-3.954894) 0.171664 / 6.500664 (-6.329000) 0.065526 / 0.075469 (-0.009943)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.204398 / 1.841788 (-0.637390) 15.625790 / 8.074308 (7.551482) 14.614980 / 10.191392 (4.423588) 0.167135 / 0.680424 (-0.513289) 0.017631 / 0.534201 (-0.516570) 0.427337 / 0.579283 (-0.151946) 0.439203 / 0.434364 (0.004839) 0.499670 / 0.540337 (-0.040668) 0.587577 / 1.386936 (-0.799359)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007866 / 0.011353 (-0.003486) 0.005798 / 0.011008 (-0.005210) 0.075803 / 0.038508 (0.037295) 0.035773 / 0.023109 (0.012664) 0.361965 / 0.275898 (0.086067) 0.402780 / 0.323480 (0.079300) 0.006521 / 0.007986 (-0.001465) 0.004613 / 0.004328 (0.000284) 0.075196 / 0.004250 (0.070946) 0.055324 / 0.037052 (0.018272) 0.363468 / 0.258489 (0.104979) 0.410344 / 0.293841 (0.116503) 0.036324 / 0.128546 (-0.092222) 0.012891 / 0.075646 (-0.062755) 0.086991 / 0.419271 (-0.332280) 0.048082 / 0.043533 (0.004549) 0.357238 / 0.255139 (0.102099) 0.377065 / 0.283200 (0.093865) 0.118586 / 0.141683 (-0.023097) 1.463161 / 1.452155 (0.011007) 1.582686 / 1.492716 (0.089969)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.267916 / 0.018006 (0.249909) 0.540862 / 0.000490 (0.540373) 0.003148 / 0.000200 (0.002948) 0.000101 / 0.000054 (0.000047)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032290 / 0.037411 (-0.005122) 0.115468 / 0.014526 (0.100943) 0.125743 / 0.176557 (-0.050814) 0.177469 / 0.737135 (-0.559667) 0.133579 / 0.296338 (-0.162759)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.446727 / 0.215209 (0.231518) 4.467938 / 2.077655 (2.390284) 2.330171 / 1.504120 (0.826052) 2.165624 / 1.541195 (0.624429) 2.298063 / 1.468490 (0.829573) 0.702241 / 4.584777 (-3.882536) 3.845302 / 3.745712 (0.099590) 2.169278 / 5.269862 (-3.100584) 1.401392 / 4.565676 (-3.164285) 0.086672 / 0.424275 (-0.337603) 0.012355 / 0.007607 (0.004748) 0.543639 / 0.226044 (0.317595) 5.425876 / 2.268929 (3.156947) 2.781794 / 55.444624 (-52.662831) 2.503724 / 6.876477 (-4.372752) 2.622580 / 2.142072 (0.480507) 0.847143 / 4.805227 (-3.958084) 0.171721 / 6.500664 (-6.328943) 0.067894 / 0.075469 (-0.007575)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.292194 / 1.841788 (-0.549594) 15.497311 / 8.074308 (7.423003) 15.002463 / 10.191392 (4.811071) 0.152244 / 0.680424 (-0.528180) 0.018085 / 0.534201 (-0.516116) 0.445787 / 0.579283 (-0.133496) 0.448960 / 0.434364 (0.014596) 0.515319 / 0.540337 (-0.025019) 0.623840 / 1.386936 (-0.763096)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 4, 2023

The documentation is not available anymore as the PR was closed or merged.

@github-actions
Copy link

github-actions bot commented May 9, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006938 / 0.011353 (-0.004415) 0.005100 / 0.011008 (-0.005909) 0.096525 / 0.038508 (0.058017) 0.033764 / 0.023109 (0.010655) 0.301107 / 0.275898 (0.025209) 0.333140 / 0.323480 (0.009660) 0.005719 / 0.007986 (-0.002266) 0.005192 / 0.004328 (0.000864) 0.073685 / 0.004250 (0.069434) 0.048149 / 0.037052 (0.011096) 0.299244 / 0.258489 (0.040754) 0.347518 / 0.293841 (0.053677) 0.034810 / 0.128546 (-0.093736) 0.012284 / 0.075646 (-0.063363) 0.333600 / 0.419271 (-0.085672) 0.050750 / 0.043533 (0.007217) 0.299782 / 0.255139 (0.044643) 0.322712 / 0.283200 (0.039512) 0.105659 / 0.141683 (-0.036024) 1.457536 / 1.452155 (0.005381) 1.571604 / 1.492716 (0.078887)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.207190 / 0.018006 (0.189184) 0.439230 / 0.000490 (0.438740) 0.006403 / 0.000200 (0.006203) 0.000282 / 0.000054 (0.000228)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027424 / 0.037411 (-0.009987) 0.107180 / 0.014526 (0.092655) 0.118356 / 0.176557 (-0.058201) 0.175557 / 0.737135 (-0.561579) 0.125671 / 0.296338 (-0.170668)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.411249 / 0.215209 (0.196039) 4.094494 / 2.077655 (2.016839) 1.946843 / 1.504120 (0.442723) 1.766503 / 1.541195 (0.225308) 1.831406 / 1.468490 (0.362916) 0.704637 / 4.584777 (-3.880140) 3.819204 / 3.745712 (0.073492) 3.412598 / 5.269862 (-1.857263) 1.796385 / 4.565676 (-2.769291) 0.084591 / 0.424275 (-0.339684) 0.012568 / 0.007607 (0.004961) 0.506372 / 0.226044 (0.280327) 5.049461 / 2.268929 (2.780532) 2.409860 / 55.444624 (-53.034765) 2.064514 / 6.876477 (-4.811963) 2.192808 / 2.142072 (0.050735) 0.833773 / 4.805227 (-3.971455) 0.167948 / 6.500664 (-6.332716) 0.064617 / 0.075469 (-0.010852)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.174739 / 1.841788 (-0.667048) 14.605634 / 8.074308 (6.531326) 14.321043 / 10.191392 (4.129651) 0.145892 / 0.680424 (-0.534532) 0.017413 / 0.534201 (-0.516788) 0.444940 / 0.579283 (-0.134343) 0.430792 / 0.434364 (-0.003572) 0.539699 / 0.540337 (-0.000638) 0.640279 / 1.386936 (-0.746657)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007159 / 0.011353 (-0.004194) 0.005313 / 0.011008 (-0.005695) 0.073630 / 0.038508 (0.035122) 0.033459 / 0.023109 (0.010350) 0.356959 / 0.275898 (0.081061) 0.385918 / 0.323480 (0.062438) 0.005714 / 0.007986 (-0.002272) 0.004074 / 0.004328 (-0.000254) 0.073278 / 0.004250 (0.069028) 0.047193 / 0.037052 (0.010140) 0.360300 / 0.258489 (0.101811) 0.398052 / 0.293841 (0.104212) 0.035670 / 0.128546 (-0.092876) 0.012499 / 0.075646 (-0.063147) 0.086677 / 0.419271 (-0.332595) 0.046534 / 0.043533 (0.003001) 0.370029 / 0.255139 (0.114890) 0.376040 / 0.283200 (0.092841) 0.105184 / 0.141683 (-0.036499) 1.419779 / 1.452155 (-0.032375) 1.538925 / 1.492716 (0.046209)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.220465 / 0.018006 (0.202459) 0.438836 / 0.000490 (0.438346) 0.000428 / 0.000200 (0.000228) 0.000060 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029114 / 0.037411 (-0.008298) 0.111871 / 0.014526 (0.097345) 0.124367 / 0.176557 (-0.052189) 0.173737 / 0.737135 (-0.563398) 0.128435 / 0.296338 (-0.167904)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.440706 / 0.215209 (0.225497) 4.414826 / 2.077655 (2.337171) 2.128899 / 1.504120 (0.624780) 1.929551 / 1.541195 (0.388357) 2.013130 / 1.468490 (0.544640) 0.708566 / 4.584777 (-3.876211) 3.846459 / 3.745712 (0.100747) 2.158829 / 5.269862 (-3.111032) 1.339454 / 4.565676 (-3.226223) 0.086345 / 0.424275 (-0.337930) 0.012085 / 0.007607 (0.004478) 0.546360 / 0.226044 (0.320316) 5.461612 / 2.268929 (3.192683) 2.657388 / 55.444624 (-52.787237) 2.298403 / 6.876477 (-4.578074) 2.344572 / 2.142072 (0.202499) 0.844276 / 4.805227 (-3.960951) 0.170225 / 6.500664 (-6.330439) 0.064684 / 0.075469 (-0.010785)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.265114 / 1.841788 (-0.576674) 15.058156 / 8.074308 (6.983848) 14.485182 / 10.191392 (4.293790) 0.165960 / 0.680424 (-0.514464) 0.017481 / 0.534201 (-0.516719) 0.425141 / 0.579283 (-0.154142) 0.434883 / 0.434364 (0.000519) 0.506701 / 0.540337 (-0.033637) 0.613240 / 1.386936 (-0.773697)

@lhoestq lhoestq marked this pull request as ready for review May 10, 2023 11:40
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007651 / 0.011353 (-0.003702) 0.005503 / 0.011008 (-0.005505) 0.098751 / 0.038508 (0.060243) 0.036822 / 0.023109 (0.013713) 0.340754 / 0.275898 (0.064856) 0.387247 / 0.323480 (0.063767) 0.006513 / 0.007986 (-0.001473) 0.006135 / 0.004328 (0.001807) 0.073656 / 0.004250 (0.069406) 0.055508 / 0.037052 (0.018456) 0.352493 / 0.258489 (0.094004) 0.408003 / 0.293841 (0.114162) 0.036346 / 0.128546 (-0.092201) 0.012562 / 0.075646 (-0.063085) 0.335111 / 0.419271 (-0.084160) 0.051928 / 0.043533 (0.008395) 0.339405 / 0.255139 (0.084266) 0.366840 / 0.283200 (0.083640) 0.114353 / 0.141683 (-0.027330) 1.449062 / 1.452155 (-0.003092) 1.567310 / 1.492716 (0.074594)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.262975 / 0.018006 (0.244968) 0.570302 / 0.000490 (0.569813) 0.003419 / 0.000200 (0.003219) 0.000100 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027363 / 0.037411 (-0.010049) 0.109033 / 0.014526 (0.094507) 0.119048 / 0.176557 (-0.057509) 0.175891 / 0.737135 (-0.561244) 0.124577 / 0.296338 (-0.171762)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.397988 / 0.215209 (0.182779) 3.993210 / 2.077655 (1.915555) 1.809275 / 1.504120 (0.305155) 1.614664 / 1.541195 (0.073469) 1.723650 / 1.468490 (0.255159) 0.698484 / 4.584777 (-3.886293) 3.914135 / 3.745712 (0.168423) 2.142622 / 5.269862 (-3.127239) 1.360215 / 4.565676 (-3.205461) 0.086340 / 0.424275 (-0.337935) 0.012836 / 0.007607 (0.005229) 0.500728 / 0.226044 (0.274684) 5.006744 / 2.268929 (2.737815) 2.350668 / 55.444624 (-53.093956) 1.979816 / 6.876477 (-4.896660) 2.190159 / 2.142072 (0.048087) 0.854063 / 4.805227 (-3.951164) 0.170203 / 6.500664 (-6.330461) 0.066903 / 0.075469 (-0.008566)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.184012 / 1.841788 (-0.657775) 15.407350 / 8.074308 (7.333042) 14.758180 / 10.191392 (4.566788) 0.169280 / 0.680424 (-0.511144) 0.017419 / 0.534201 (-0.516781) 0.434359 / 0.579283 (-0.144925) 0.442515 / 0.434364 (0.008151) 0.503132 / 0.540337 (-0.037205) 0.602589 / 1.386936 (-0.784347)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008022 / 0.011353 (-0.003331) 0.005473 / 0.011008 (-0.005535) 0.076106 / 0.038508 (0.037598) 0.037065 / 0.023109 (0.013956) 0.380039 / 0.275898 (0.104141) 0.394205 / 0.323480 (0.070725) 0.006447 / 0.007986 (-0.001539) 0.006011 / 0.004328 (0.001682) 0.075236 / 0.004250 (0.070985) 0.054425 / 0.037052 (0.017372) 0.381707 / 0.258489 (0.123218) 0.411237 / 0.293841 (0.117396) 0.037222 / 0.128546 (-0.091324) 0.012627 / 0.075646 (-0.063020) 0.086733 / 0.419271 (-0.332538) 0.053857 / 0.043533 (0.010324) 0.373374 / 0.255139 (0.118235) 0.381680 / 0.283200 (0.098480) 0.121962 / 0.141683 (-0.019721) 1.430804 / 1.452155 (-0.021351) 1.562517 / 1.492716 (0.069801)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.262034 / 0.018006 (0.244028) 0.563497 / 0.000490 (0.563007) 0.002726 / 0.000200 (0.002526) 0.000099 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031071 / 0.037411 (-0.006341) 0.111983 / 0.014526 (0.097457) 0.126634 / 0.176557 (-0.049923) 0.177511 / 0.737135 (-0.559625) 0.132599 / 0.296338 (-0.163739)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.436148 / 0.215209 (0.220939) 4.344850 / 2.077655 (2.267195) 2.105877 / 1.504120 (0.601757) 1.920934 / 1.541195 (0.379739) 2.072930 / 1.468490 (0.604440) 0.701793 / 4.584777 (-3.882984) 3.841621 / 3.745712 (0.095909) 3.602550 / 5.269862 (-1.667311) 1.775999 / 4.565676 (-2.789677) 0.086024 / 0.424275 (-0.338251) 0.012275 / 0.007607 (0.004668) 0.532815 / 0.226044 (0.306770) 5.336273 / 2.268929 (3.067344) 2.638842 / 55.444624 (-52.805782) 2.301842 / 6.876477 (-4.574635) 2.407448 / 2.142072 (0.265376) 0.855836 / 4.805227 (-3.949392) 0.170348 / 6.500664 (-6.330317) 0.066926 / 0.075469 (-0.008543)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.291515 / 1.841788 (-0.550272) 15.869825 / 8.074308 (7.795517) 15.068227 / 10.191392 (4.876835) 0.156953 / 0.680424 (-0.523471) 0.017761 / 0.534201 (-0.516440) 0.429515 / 0.579283 (-0.149768) 0.432758 / 0.434364 (-0.001605) 0.500080 / 0.540337 (-0.040258) 0.601451 / 1.386936 (-0.785485)

@lhoestq
Copy link
Member Author

lhoestq commented May 19, 2023

Will need to take #5810 into account if it gets merged before this one

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006914 / 0.011353 (-0.004439) 0.004727 / 0.011008 (-0.006281) 0.098880 / 0.038508 (0.060372) 0.036663 / 0.023109 (0.013554) 0.317575 / 0.275898 (0.041677) 0.360301 / 0.323480 (0.036821) 0.006084 / 0.007986 (-0.001901) 0.004118 / 0.004328 (-0.000210) 0.074330 / 0.004250 (0.070079) 0.042422 / 0.037052 (0.005369) 0.335625 / 0.258489 (0.077136) 0.366616 / 0.293841 (0.072775) 0.028523 / 0.128546 (-0.100023) 0.008883 / 0.075646 (-0.066763) 0.332475 / 0.419271 (-0.086797) 0.051746 / 0.043533 (0.008214) 0.324952 / 0.255139 (0.069813) 0.339660 / 0.283200 (0.056460) 0.103714 / 0.141683 (-0.037969) 1.472130 / 1.452155 (0.019976) 1.516548 / 1.492716 (0.023831)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.229538 / 0.018006 (0.211532) 0.449077 / 0.000490 (0.448588) 0.003707 / 0.000200 (0.003507) 0.000086 / 0.000054 (0.000032)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027897 / 0.037411 (-0.009514) 0.115452 / 0.014526 (0.100926) 0.118830 / 0.176557 (-0.057726) 0.176228 / 0.737135 (-0.560907) 0.125966 / 0.296338 (-0.170372)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.436947 / 0.215209 (0.221738) 4.355687 / 2.077655 (2.278033) 2.195857 / 1.504120 (0.691737) 2.028133 / 1.541195 (0.486938) 2.119872 / 1.468490 (0.651382) 0.524256 / 4.584777 (-4.060521) 3.864064 / 3.745712 (0.118352) 3.446181 / 5.269862 (-1.823680) 1.610307 / 4.565676 (-2.955370) 0.065981 / 0.424275 (-0.358294) 0.012172 / 0.007607 (0.004565) 0.545341 / 0.226044 (0.319297) 5.451728 / 2.268929 (3.182800) 2.690734 / 55.444624 (-52.753890) 2.368203 / 6.876477 (-4.508274) 2.549533 / 2.142072 (0.407460) 0.651296 / 4.805227 (-4.153931) 0.143697 / 6.500664 (-6.356968) 0.065170 / 0.075469 (-0.010299)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.198898 / 1.841788 (-0.642890) 15.349348 / 8.074308 (7.275040) 15.314467 / 10.191392 (5.123075) 0.177219 / 0.680424 (-0.503205) 0.018223 / 0.534201 (-0.515978) 0.396209 / 0.579283 (-0.183074) 0.427810 / 0.434364 (-0.006554) 0.475107 / 0.540337 (-0.065230) 0.561224 / 1.386936 (-0.825712)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007024 / 0.011353 (-0.004329) 0.004851 / 0.011008 (-0.006157) 0.075031 / 0.038508 (0.036523) 0.036411 / 0.023109 (0.013302) 0.375999 / 0.275898 (0.100101) 0.433033 / 0.323480 (0.109553) 0.006089 / 0.007986 (-0.001897) 0.005638 / 0.004328 (0.001309) 0.072599 / 0.004250 (0.068348) 0.048489 / 0.037052 (0.011436) 0.381807 / 0.258489 (0.123318) 0.441531 / 0.293841 (0.147691) 0.029044 / 0.128546 (-0.099503) 0.009052 / 0.075646 (-0.066595) 0.080086 / 0.419271 (-0.339186) 0.046919 / 0.043533 (0.003386) 0.360399 / 0.255139 (0.105260) 0.405445 / 0.283200 (0.122245) 0.108815 / 0.141683 (-0.032868) 1.415168 / 1.452155 (-0.036987) 1.511756 / 1.492716 (0.019040)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.210287 / 0.018006 (0.192281) 0.445139 / 0.000490 (0.444650) 0.000386 / 0.000200 (0.000186) 0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030457 / 0.037411 (-0.006954) 0.117225 / 0.014526 (0.102699) 0.122833 / 0.176557 (-0.053724) 0.170441 / 0.737135 (-0.566694) 0.131589 / 0.296338 (-0.164750)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.446541 / 0.215209 (0.231332) 4.471214 / 2.077655 (2.393560) 2.145894 / 1.504120 (0.641774) 1.958113 / 1.541195 (0.416919) 2.069623 / 1.468490 (0.601132) 0.527562 / 4.584777 (-4.057215) 3.838285 / 3.745712 (0.092573) 1.884780 / 5.269862 (-3.385081) 1.088124 / 4.565676 (-3.477553) 0.066099 / 0.424275 (-0.358176) 0.011973 / 0.007607 (0.004366) 0.540369 / 0.226044 (0.314325) 5.403554 / 2.268929 (3.134626) 2.749920 / 55.444624 (-52.694704) 2.543169 / 6.876477 (-4.333308) 2.403116 / 2.142072 (0.261043) 0.638723 / 4.805227 (-4.166505) 0.142232 / 6.500664 (-6.358432) 0.065551 / 0.075469 (-0.009918)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.298307 / 1.841788 (-0.543481) 15.986177 / 8.074308 (7.911869) 15.530453 / 10.191392 (5.339061) 0.160138 / 0.680424 (-0.520286) 0.017988 / 0.534201 (-0.516213) 0.397857 / 0.579283 (-0.181427) 0.435071 / 0.434364 (0.000707) 0.480096 / 0.540337 (-0.060241) 0.589139 / 1.386936 (-0.797797)

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006976 / 0.011353 (-0.004377) 0.005068 / 0.011008 (-0.005940) 0.098178 / 0.038508 (0.059670) 0.035167 / 0.023109 (0.012057) 0.324093 / 0.275898 (0.048195) 0.350749 / 0.323480 (0.027269) 0.006128 / 0.007986 (-0.001858) 0.004361 / 0.004328 (0.000033) 0.075412 / 0.004250 (0.071161) 0.052083 / 0.037052 (0.015031) 0.326726 / 0.258489 (0.068237) 0.371450 / 0.293841 (0.077609) 0.028522 / 0.128546 (-0.100025) 0.009210 / 0.075646 (-0.066436) 0.329296 / 0.419271 (-0.089976) 0.051182 / 0.043533 (0.007649) 0.319863 / 0.255139 (0.064724) 0.329140 / 0.283200 (0.045941) 0.111653 / 0.141683 (-0.030030) 1.464205 / 1.452155 (0.012050) 1.555779 / 1.492716 (0.063062)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.282372 / 0.018006 (0.264366) 0.569227 / 0.000490 (0.568737) 0.005289 / 0.000200 (0.005089) 0.000095 / 0.000054 (0.000041)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.029875 / 0.037411 (-0.007537) 0.111889 / 0.014526 (0.097364) 0.125678 / 0.176557 (-0.050878) 0.184695 / 0.737135 (-0.552441) 0.129737 / 0.296338 (-0.166602)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.417404 / 0.215209 (0.202195) 4.172367 / 2.077655 (2.094712) 2.008088 / 1.504120 (0.503968) 1.813182 / 1.541195 (0.271988) 1.882727 / 1.468490 (0.414237) 0.525764 / 4.584777 (-4.059013) 3.815202 / 3.745712 (0.069490) 1.884197 / 5.269862 (-3.385664) 1.073779 / 4.565676 (-3.491897) 0.066125 / 0.424275 (-0.358150) 0.012473 / 0.007607 (0.004866) 0.522197 / 0.226044 (0.296153) 5.218486 / 2.268929 (2.949557) 2.413846 / 55.444624 (-53.030779) 2.093298 / 6.876477 (-4.783179) 2.320583 / 2.142072 (0.178511) 0.648832 / 4.805227 (-4.156395) 0.146168 / 6.500664 (-6.354496) 0.065869 / 0.075469 (-0.009600)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.181859 / 1.841788 (-0.659929) 15.369517 / 8.074308 (7.295209) 14.896270 / 10.191392 (4.704878) 0.146793 / 0.680424 (-0.533630) 0.017960 / 0.534201 (-0.516241) 0.421801 / 0.579283 (-0.157482) 0.438357 / 0.434364 (0.003993) 0.524554 / 0.540337 (-0.015783) 0.621041 / 1.386936 (-0.765895)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007104 / 0.011353 (-0.004249) 0.004895 / 0.011008 (-0.006113) 0.075641 / 0.038508 (0.037133) 0.034821 / 0.023109 (0.011712) 0.363875 / 0.275898 (0.087977) 0.403042 / 0.323480 (0.079562) 0.006747 / 0.007986 (-0.001238) 0.005793 / 0.004328 (0.001465) 0.074709 / 0.004250 (0.070458) 0.058801 / 0.037052 (0.021749) 0.366900 / 0.258489 (0.108411) 0.414442 / 0.293841 (0.120601) 0.029099 / 0.128546 (-0.099448) 0.009394 / 0.075646 (-0.066253) 0.082612 / 0.419271 (-0.336659) 0.049076 / 0.043533 (0.005543) 0.358828 / 0.255139 (0.103689) 0.378261 / 0.283200 (0.095061) 0.122147 / 0.141683 (-0.019535) 1.454155 / 1.452155 (0.002000) 1.572437 / 1.492716 (0.079720)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.293133 / 0.018006 (0.275127) 0.536785 / 0.000490 (0.536295) 0.000457 / 0.000200 (0.000257) 0.000058 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031046 / 0.037411 (-0.006366) 0.113929 / 0.014526 (0.099403) 0.126222 / 0.176557 (-0.050335) 0.173992 / 0.737135 (-0.563143) 0.129635 / 0.296338 (-0.166704)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.441984 / 0.215209 (0.226775) 4.406002 / 2.077655 (2.328348) 2.173912 / 1.504120 (0.669792) 2.000507 / 1.541195 (0.459312) 2.172766 / 1.468490 (0.704276) 0.524530 / 4.584777 (-4.060247) 3.758827 / 3.745712 (0.013115) 1.886701 / 5.269862 (-3.383160) 1.073601 / 4.565676 (-3.492075) 0.066137 / 0.424275 (-0.358139) 0.011926 / 0.007607 (0.004319) 0.541103 / 0.226044 (0.315059) 5.404162 / 2.268929 (3.135233) 2.634271 / 55.444624 (-52.810354) 2.366156 / 6.876477 (-4.510321) 2.566877 / 2.142072 (0.424804) 0.639088 / 4.805227 (-4.166139) 0.141810 / 6.500664 (-6.358854) 0.065446 / 0.075469 (-0.010023)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.288173 / 1.841788 (-0.553614) 15.897051 / 8.074308 (7.822743) 15.243404 / 10.191392 (5.052012) 0.162380 / 0.680424 (-0.518043) 0.017716 / 0.534201 (-0.516485) 0.396400 / 0.579283 (-0.182883) 0.420479 / 0.434364 (-0.013885) 0.476238 / 0.540337 (-0.064099) 0.583039 / 1.386936 (-0.803897)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the enhancement!!
Some comments below...

ex_iterable = ExamplesIterable(Dataset._generate_examples_from_shards, kwargs={"shards": shards})
ex_iterable = ArrowExamplesIterable(
Dataset._generate_tables_from_shards,
kwargs={"shards": shards, "batch_size": config.DEFAULT_MAX_BATCH_SIZE},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should support users to pass a custom batch_size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I'm not sure - we can wait for some feedback on this and improve later imo

def _convert_to_arrow(
iterable: Iterable[Tuple[Key, dict]],
batch_size: int,
drop_last_batch=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type hint here.

batch_size: int,
drop_last_batch=False,
) -> Iterator[Tuple[Key, pa.Table]]:
"""Iterate over sub-tables of size `batch_size`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition in the docstring is the same as the one for _batch_arrow_tables below. Maybe we should add to the description they expect different iterables...

Maybe also the naming of both functions could be aligned: now they have very different names whereas their functionality is analogous...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having different names because one is expensive (contains "convert" in the name) while the other one is not (simply batches data)

src/datasets/iterable_dataset.py Show resolved Hide resolved
@lhoestq
Copy link
Member Author

lhoestq commented May 26, 2023

I fixed the docstring and type hint

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006310 / 0.011353 (-0.005043) 0.004297 / 0.011008 (-0.006711) 0.098288 / 0.038508 (0.059780) 0.029295 / 0.023109 (0.006185) 0.386804 / 0.275898 (0.110906) 0.425717 / 0.323480 (0.102237) 0.005516 / 0.007986 (-0.002470) 0.005058 / 0.004328 (0.000730) 0.074318 / 0.004250 (0.070068) 0.040609 / 0.037052 (0.003557) 0.388159 / 0.258489 (0.129670) 0.428683 / 0.293841 (0.134842) 0.026207 / 0.128546 (-0.102340) 0.008655 / 0.075646 (-0.066991) 0.321601 / 0.419271 (-0.097671) 0.055329 / 0.043533 (0.011796) 0.390452 / 0.255139 (0.135313) 0.409084 / 0.283200 (0.125884) 0.099555 / 0.141683 (-0.042128) 1.484289 / 1.452155 (0.032134) 1.549892 / 1.492716 (0.057176)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.219466 / 0.018006 (0.201460) 0.437288 / 0.000490 (0.436798) 0.003556 / 0.000200 (0.003356) 0.000080 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023876 / 0.037411 (-0.013535) 0.100205 / 0.014526 (0.085679) 0.106365 / 0.176557 (-0.070191) 0.164353 / 0.737135 (-0.572782) 0.109987 / 0.296338 (-0.186352)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.418819 / 0.215209 (0.203610) 4.168558 / 2.077655 (2.090903) 1.862883 / 1.504120 (0.358764) 1.673308 / 1.541195 (0.132114) 1.742338 / 1.468490 (0.273848) 0.550113 / 4.584777 (-4.034664) 3.492085 / 3.745712 (-0.253627) 1.734579 / 5.269862 (-3.535283) 1.006876 / 4.565676 (-3.558801) 0.068014 / 0.424275 (-0.356261) 0.012242 / 0.007607 (0.004634) 0.520633 / 0.226044 (0.294588) 5.214095 / 2.268929 (2.945167) 2.319282 / 55.444624 (-53.125343) 1.979521 / 6.876477 (-4.896956) 2.099595 / 2.142072 (-0.042477) 0.659306 / 4.805227 (-4.145921) 0.135282 / 6.500664 (-6.365382) 0.067417 / 0.075469 (-0.008052)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.232099 / 1.841788 (-0.609689) 13.967219 / 8.074308 (5.892910) 14.347105 / 10.191392 (4.155713) 0.146360 / 0.680424 (-0.534063) 0.017021 / 0.534201 (-0.517180) 0.363254 / 0.579283 (-0.216030) 0.404391 / 0.434364 (-0.029973) 0.428670 / 0.540337 (-0.111668) 0.514942 / 1.386936 (-0.871994)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006360 / 0.011353 (-0.004993) 0.004160 / 0.011008 (-0.006848) 0.074856 / 0.038508 (0.036347) 0.028624 / 0.023109 (0.005515) 0.355624 / 0.275898 (0.079726) 0.403678 / 0.323480 (0.080198) 0.005253 / 0.007986 (-0.002732) 0.004808 / 0.004328 (0.000480) 0.074215 / 0.004250 (0.069964) 0.040641 / 0.037052 (0.003588) 0.358473 / 0.258489 (0.099984) 0.414442 / 0.293841 (0.120601) 0.025595 / 0.128546 (-0.102951) 0.008506 / 0.075646 (-0.067140) 0.081547 / 0.419271 (-0.337725) 0.039719 / 0.043533 (-0.003814) 0.355420 / 0.255139 (0.100281) 0.380953 / 0.283200 (0.097753) 0.100064 / 0.141683 (-0.041618) 1.459639 / 1.452155 (0.007484) 1.557288 / 1.492716 (0.064572)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.232837 / 0.018006 (0.214831) 0.424788 / 0.000490 (0.424298) 0.000397 / 0.000200 (0.000197) 0.000059 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026156 / 0.037411 (-0.011256) 0.103633 / 0.014526 (0.089107) 0.109633 / 0.176557 (-0.066923) 0.159407 / 0.737135 (-0.577728) 0.113874 / 0.296338 (-0.182465)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.471980 / 0.215209 (0.256771) 4.724424 / 2.077655 (2.646769) 2.459950 / 1.504120 (0.955830) 2.280926 / 1.541195 (0.739731) 2.368478 / 1.468490 (0.899987) 0.552809 / 4.584777 (-4.031968) 3.461985 / 3.745712 (-0.283728) 1.757060 / 5.269862 (-3.512802) 1.009599 / 4.565676 (-3.556077) 0.068407 / 0.424275 (-0.355868) 0.012341 / 0.007607 (0.004734) 0.576287 / 0.226044 (0.350242) 5.767331 / 2.268929 (3.498402) 2.965743 / 55.444624 (-52.478882) 2.644935 / 6.876477 (-4.231542) 2.699663 / 2.142072 (0.557591) 0.656005 / 4.805227 (-4.149222) 0.136315 / 6.500664 (-6.364349) 0.068355 / 0.075469 (-0.007114)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.308301 / 1.841788 (-0.533486) 14.587268 / 8.074308 (6.512960) 14.385670 / 10.191392 (4.194278) 0.148154 / 0.680424 (-0.532270) 0.016798 / 0.534201 (-0.517402) 0.360761 / 0.579283 (-0.218523) 0.392566 / 0.434364 (-0.041798) 0.431604 / 0.540337 (-0.108734) 0.528463 / 1.386936 (-0.858473)

@lhoestq
Copy link
Member Author

lhoestq commented May 30, 2023

let me know if it sounds good for you now @albertvillanova :)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could update the IterableDataset.with_format docstring, now thta it supports the "arrow" format as well...

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment about _batch_arrow_tables below.

Comment on lines +134 to +165
keys_buffer = []
chunks_buffer = []
chunks_buffer_size = 0
for key, pa_table in iterable:
for chunk in pa_table.to_reader(max_chunksize=batch_size):
if len(chunk) == 0:
continue
elif chunks_buffer_size + len(chunk) < batch_size:
keys_buffer.append(key)
chunks_buffer.append(chunk)
chunks_buffer_size += len(chunk)
continue
elif chunks_buffer_size + len(chunk) == batch_size:
keys_buffer.append(key)
chunks_buffer.append(chunk)
new_key = "_".join(str(_key) for _key in keys_buffer)
yield new_key, pa.Table.from_batches(chunks_buffer)
keys_buffer = []
chunks_buffer = []
chunks_buffer_size = 0
else:
cropped_chunk_length = batch_size - chunks_buffer_size
keys_buffer.append(f"{key}[:{cropped_chunk_length}]")
chunks_buffer.append(chunk.slice(0, cropped_chunk_length))
new_key = "_".join(str(_key) for _key in keys_buffer)
yield new_key, pa.Table.from_batches(chunks_buffer)
keys_buffer = [f"{key}[{cropped_chunk_length}:]"]
chunks_buffer = [chunk.slice(cropped_chunk_length, len(chunk) - cropped_chunk_length)]
chunks_buffer_size = len(chunk) - cropped_chunk_length
if not drop_last_batch and chunks_buffer:
new_key = "_".join(str(_key) for _key in keys_buffer)
yield new_key, pa.Table.from_batches(chunks_buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function just rebatches, from the original batch size returned by self.generate_tables_fn(**self.kwargs), to the target batch_size. And to do that, it generates batches from Tables (using Table.to_reader) and then Tables from batches (using Table.from_batches).

I am just wondering it the implementation could be simpler... Not sure though.

What do you think about a naive approach like?

    iterator = iter(pa_table.slice(i, 1) for _, pa_table in iterable for i in range(pa_table.num_rows))
    batch = True
    key = 0
    while batch:
        batch = list(islice(iterator, batch_size))
        if batch:
            if drop_last_batch and len(batch) < batch_size:
                continue
            yield key, pa.concat_tables(batch)
            key += 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think iterating on batches of size 1 and grouping them again is slower. It may also return tables with too many record batches (1 batch per row) to be efficiently used

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, feel free to merge as it is...

@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008414 / 0.011353 (-0.002939) 0.005320 / 0.011008 (-0.005688) 0.115585 / 0.038508 (0.077077) 0.040815 / 0.023109 (0.017706) 0.363453 / 0.275898 (0.087555) 0.385954 / 0.323480 (0.062474) 0.006463 / 0.007986 (-0.001523) 0.005571 / 0.004328 (0.001242) 0.084831 / 0.004250 (0.080581) 0.050294 / 0.037052 (0.013242) 0.375684 / 0.258489 (0.117195) 0.394672 / 0.293841 (0.100831) 0.033618 / 0.128546 (-0.094928) 0.010451 / 0.075646 (-0.065195) 0.388937 / 0.419271 (-0.030334) 0.059974 / 0.043533 (0.016441) 0.360437 / 0.255139 (0.105298) 0.375149 / 0.283200 (0.091950) 0.118397 / 0.141683 (-0.023286) 1.726759 / 1.452155 (0.274604) 1.811928 / 1.492716 (0.319212)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.239186 / 0.018006 (0.221180) 0.483728 / 0.000490 (0.483238) 0.003285 / 0.000200 (0.003085) 0.000097 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030514 / 0.037411 (-0.006898) 0.127111 / 0.014526 (0.112585) 0.136185 / 0.176557 (-0.040371) 0.204541 / 0.737135 (-0.532594) 0.143228 / 0.296338 (-0.153111)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.465840 / 0.215209 (0.250631) 4.611160 / 2.077655 (2.533506) 2.119307 / 1.504120 (0.615187) 1.882463 / 1.541195 (0.341268) 1.946067 / 1.468490 (0.477577) 0.602352 / 4.584777 (-3.982425) 4.576313 / 3.745712 (0.830601) 2.112860 / 5.269862 (-3.157001) 1.224388 / 4.565676 (-3.341289) 0.073808 / 0.424275 (-0.350467) 0.013157 / 0.007607 (0.005550) 0.592208 / 0.226044 (0.366163) 5.948971 / 2.268929 (3.680042) 2.690144 / 55.444624 (-52.754480) 2.236489 / 6.876477 (-4.639987) 2.423617 / 2.142072 (0.281545) 0.752053 / 4.805227 (-4.053175) 0.168185 / 6.500664 (-6.332480) 0.075454 / 0.075469 (-0.000015)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.407432 / 1.841788 (-0.434356) 17.054545 / 8.074308 (8.980236) 15.661362 / 10.191392 (5.469970) 0.175027 / 0.680424 (-0.505397) 0.020262 / 0.534201 (-0.513939) 0.479052 / 0.579283 (-0.100231) 0.509829 / 0.434364 (0.075465) 0.601935 / 0.540337 (0.061598) 0.726754 / 1.386936 (-0.660182)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007698 / 0.011353 (-0.003655) 0.005267 / 0.011008 (-0.005741) 0.085832 / 0.038508 (0.047324) 0.041974 / 0.023109 (0.018865) 0.418966 / 0.275898 (0.143068) 0.466314 / 0.323480 (0.142834) 0.006580 / 0.007986 (-0.001406) 0.007063 / 0.004328 (0.002735) 0.087120 / 0.004250 (0.082870) 0.054908 / 0.037052 (0.017856) 0.423813 / 0.258489 (0.165323) 0.489878 / 0.293841 (0.196037) 0.032823 / 0.128546 (-0.095723) 0.010471 / 0.075646 (-0.065175) 0.095839 / 0.419271 (-0.323432) 0.056421 / 0.043533 (0.012888) 0.420526 / 0.255139 (0.165387) 0.447975 / 0.283200 (0.164775) 0.126604 / 0.141683 (-0.015079) 1.723097 / 1.452155 (0.270942) 1.819539 / 1.492716 (0.326822)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.279604 / 0.018006 (0.261598) 0.496129 / 0.000490 (0.495639) 0.005419 / 0.000200 (0.005219) 0.000096 / 0.000054 (0.000041)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035069 / 0.037411 (-0.002343) 0.133064 / 0.014526 (0.118538) 0.145404 / 0.176557 (-0.031152) 0.205237 / 0.737135 (-0.531898) 0.150684 / 0.296338 (-0.145654)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.513596 / 0.215209 (0.298387) 5.104861 / 2.077655 (3.027206) 2.487908 / 1.504120 (0.983788) 2.271383 / 1.541195 (0.730188) 2.421043 / 1.468490 (0.952553) 0.625204 / 4.584777 (-3.959573) 4.555389 / 3.745712 (0.809677) 4.181518 / 5.269862 (-1.088344) 1.676059 / 4.565676 (-2.889617) 0.078786 / 0.424275 (-0.345489) 0.014186 / 0.007607 (0.006579) 0.638360 / 0.226044 (0.412315) 6.367915 / 2.268929 (4.098986) 3.095175 / 55.444624 (-52.349449) 2.706707 / 6.876477 (-4.169769) 2.735907 / 2.142072 (0.593835) 0.756323 / 4.805227 (-4.048905) 0.164783 / 6.500664 (-6.335881) 0.076291 / 0.075469 (0.000822)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.667058 / 1.841788 (-0.174730) 18.687459 / 8.074308 (10.613151) 17.111596 / 10.191392 (6.920204) 0.167218 / 0.680424 (-0.513206) 0.020995 / 0.534201 (-0.513206) 0.463985 / 0.579283 (-0.115298) 0.502705 / 0.434364 (0.068341) 0.562877 / 0.540337 (0.022540) 0.682249 / 1.386936 (-0.704687)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After your commit 028822a, you say with_format only supports "arrow" or None.

Maybe we should fix all the tests in test_iterable_dataset.py that contain .with_format("torch")?

Otherwise, feel free to merge as it is.

@lhoestq
Copy link
Member Author

lhoestq commented May 31, 2023

Maybe we should fix all the tests in test_iterable_dataset.py that contain .with_format("torch")?

they're updated in #5852

@lhoestq lhoestq merged commit 7437d0f into main May 31, 2023
@lhoestq lhoestq deleted the iterable-arrow-formatting branch May 31, 2023 09:36
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005931 / 0.011353 (-0.005421) 0.004004 / 0.011008 (-0.007004) 0.098632 / 0.038508 (0.060124) 0.027820 / 0.023109 (0.004711) 0.302944 / 0.275898 (0.027046) 0.332684 / 0.323480 (0.009204) 0.005529 / 0.007986 (-0.002457) 0.004814 / 0.004328 (0.000485) 0.074477 / 0.004250 (0.070227) 0.034875 / 0.037052 (-0.002178) 0.304542 / 0.258489 (0.046053) 0.342853 / 0.293841 (0.049012) 0.025263 / 0.128546 (-0.103283) 0.008558 / 0.075646 (-0.067089) 0.322522 / 0.419271 (-0.096750) 0.043980 / 0.043533 (0.000447) 0.306618 / 0.255139 (0.051479) 0.331692 / 0.283200 (0.048492) 0.087434 / 0.141683 (-0.054248) 1.464686 / 1.452155 (0.012531) 1.575038 / 1.492716 (0.082322)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.221920 / 0.018006 (0.203914) 0.417108 / 0.000490 (0.416619) 0.004625 / 0.000200 (0.004425) 0.000079 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023493 / 0.037411 (-0.013918) 0.096684 / 0.014526 (0.082158) 0.102035 / 0.176557 (-0.074522) 0.166609 / 0.737135 (-0.570526) 0.107456 / 0.296338 (-0.188883)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.418713 / 0.215209 (0.203504) 4.156913 / 2.077655 (2.079258) 1.869064 / 1.504120 (0.364944) 1.666219 / 1.541195 (0.125024) 1.676491 / 1.468490 (0.208001) 0.553843 / 4.584777 (-4.030934) 3.380471 / 3.745712 (-0.365241) 2.970370 / 5.269862 (-2.299491) 1.421597 / 4.565676 (-3.144080) 0.068019 / 0.424275 (-0.356256) 0.012995 / 0.007607 (0.005387) 0.519410 / 0.226044 (0.293365) 5.198251 / 2.268929 (2.929323) 2.352969 / 55.444624 (-53.091655) 2.008981 / 6.876477 (-4.867496) 2.066519 / 2.142072 (-0.075553) 0.658982 / 4.805227 (-4.146245) 0.134341 / 6.500664 (-6.366323) 0.065893 / 0.075469 (-0.009576)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.207509 / 1.841788 (-0.634279) 13.863838 / 8.074308 (5.789530) 13.363359 / 10.191392 (3.171967) 0.129076 / 0.680424 (-0.551348) 0.016818 / 0.534201 (-0.517383) 0.357956 / 0.579283 (-0.221327) 0.386174 / 0.434364 (-0.048189) 0.418663 / 0.540337 (-0.121674) 0.498708 / 1.386936 (-0.888228)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006132 / 0.011353 (-0.005220) 0.004335 / 0.011008 (-0.006673) 0.078517 / 0.038508 (0.040009) 0.027685 / 0.023109 (0.004576) 0.357956 / 0.275898 (0.082058) 0.392397 / 0.323480 (0.068918) 0.005364 / 0.007986 (-0.002622) 0.004922 / 0.004328 (0.000593) 0.078061 / 0.004250 (0.073810) 0.038889 / 0.037052 (0.001837) 0.360952 / 0.258489 (0.102463) 0.402790 / 0.293841 (0.108949) 0.025542 / 0.128546 (-0.103004) 0.008718 / 0.075646 (-0.066929) 0.085799 / 0.419271 (-0.333472) 0.044256 / 0.043533 (0.000723) 0.358366 / 0.255139 (0.103227) 0.393500 / 0.283200 (0.110300) 0.096382 / 0.141683 (-0.045301) 1.530889 / 1.452155 (0.078735) 1.621007 / 1.492716 (0.128291)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.180572 / 0.018006 (0.162566) 0.429478 / 0.000490 (0.428988) 0.002966 / 0.000200 (0.002766) 0.000074 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.024530 / 0.037411 (-0.012881) 0.101401 / 0.014526 (0.086875) 0.108208 / 0.176557 (-0.068349) 0.159582 / 0.737135 (-0.577554) 0.111170 / 0.296338 (-0.185168)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.465768 / 0.215209 (0.250559) 4.706311 / 2.077655 (2.628656) 2.437756 / 1.504120 (0.933636) 2.245694 / 1.541195 (0.704499) 2.282637 / 1.468490 (0.814147) 0.552752 / 4.584777 (-4.032025) 3.432992 / 3.745712 (-0.312720) 1.800054 / 5.269862 (-3.469808) 1.037852 / 4.565676 (-3.527824) 0.068240 / 0.424275 (-0.356035) 0.012433 / 0.007607 (0.004826) 0.574867 / 0.226044 (0.348822) 5.707623 / 2.268929 (3.438695) 2.909746 / 55.444624 (-52.534878) 2.585423 / 6.876477 (-4.291054) 2.636801 / 2.142072 (0.494729) 0.686593 / 4.805227 (-4.118634) 0.136633 / 6.500664 (-6.364031) 0.068598 / 0.075469 (-0.006871)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.286628 / 1.841788 (-0.555159) 14.333258 / 8.074308 (6.258949) 14.355793 / 10.191392 (4.164401) 0.133459 / 0.680424 (-0.546965) 0.017090 / 0.534201 (-0.517111) 0.358852 / 0.579283 (-0.220431) 0.399929 / 0.434364 (-0.034435) 0.422838 / 0.540337 (-0.117500) 0.515199 / 1.386936 (-0.871737)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants