IterableDataset Arrow formatting #5821

lhoestq · 2023-05-04T17:23:43Z

Adding an optional .iter_arrow to examples iterable. This allows to use Arrow formatting in map/filter.

This will also be useful for torch formatting, since we can reuse the TorchFormatter that converts Arrow data to torch tensors

Related to #5793 and #3444

Required for #5852

Example:

Speed x10 in map

from datasets import Dataset
import pyarrow.compute as pc
import time


ds = Dataset.from_dict({"a": range(100_000)})


ids = ds.to_iterable_dataset()
ids = ids.map(lambda x: {"a": [a + 10 for a in x["a"]]}, batched=True)

_start = time.time()
print(f"Python ({sum(1 for _ in ids)} items):\t{(time.time() - _start) * 1000:.1f}ms")
# Python (100000 items):  695.7ms

ids = ds.to_iterable_dataset().with_format("arrow")
ids = ids.map(lambda t: t.set_column(0, "a", pc.add(t[0], 10)), batched=True)
ids = ids.with_format(None)

_start = time.time()
print(f"Arrow ({sum(1 for _ in ids)} items):\t{(time.time() - _start) * 1000:.1f}ms)")
# Arrow (100000 items):   81.0ms)

Implementation details

I added an optional iter_arrow method to examples iterable. If an example iterable has this method, then it can be used to iterate on the examples by batch of arrow tables.

github-actions · 2023-05-04T17:27:31Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007593 / 0.011353 (-0.003760)	0.005554 / 0.011008 (-0.005454)	0.097663 / 0.038508 (0.059155)	0.034915 / 0.023109 (0.011806)	0.303116 / 0.275898 (0.027218)	0.342376 / 0.323480 (0.018897)	0.006044 / 0.007986 (-0.001942)	0.004239 / 0.004328 (-0.000090)	0.074561 / 0.004250 (0.070310)	0.049109 / 0.037052 (0.012057)	0.311302 / 0.258489 (0.052813)	0.360717 / 0.293841 (0.066876)	0.035119 / 0.128546 (-0.093428)	0.012465 / 0.075646 (-0.063181)	0.333648 / 0.419271 (-0.085624)	0.051294 / 0.043533 (0.007762)	0.297298 / 0.255139 (0.042159)	0.321957 / 0.283200 (0.038757)	0.108206 / 0.141683 (-0.033477)	1.425023 / 1.452155 (-0.027132)	1.526395 / 1.492716 (0.033678)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.300694 / 0.018006 (0.282688)	0.515141 / 0.000490 (0.514651)	0.003965 / 0.000200 (0.003765)	0.000260 / 0.000054 (0.000206)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029428 / 0.037411 (-0.007983)	0.107634 / 0.014526 (0.093108)	0.123662 / 0.176557 (-0.052895)	0.182886 / 0.737135 (-0.554249)	0.128361 / 0.296338 (-0.167977)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.398809 / 0.215209 (0.183600)	3.984428 / 2.077655 (1.906773)	1.795337 / 1.504120 (0.291217)	1.609235 / 1.541195 (0.068040)	1.724825 / 1.468490 (0.256335)	0.698413 / 4.584777 (-3.886364)	3.857479 / 3.745712 (0.111767)	2.135203 / 5.269862 (-3.134659)	1.348458 / 4.565676 (-3.217218)	0.086445 / 0.424275 (-0.337830)	0.012717 / 0.007607 (0.005110)	0.498713 / 0.226044 (0.272668)	4.988685 / 2.268929 (2.719757)	2.284764 / 55.444624 (-53.159860)	1.961162 / 6.876477 (-4.915315)	2.147514 / 2.142072 (0.005441)	0.850334 / 4.805227 (-3.954894)	0.171664 / 6.500664 (-6.329000)	0.065526 / 0.075469 (-0.009943)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.204398 / 1.841788 (-0.637390)	15.625790 / 8.074308 (7.551482)	14.614980 / 10.191392 (4.423588)	0.167135 / 0.680424 (-0.513289)	0.017631 / 0.534201 (-0.516570)	0.427337 / 0.579283 (-0.151946)	0.439203 / 0.434364 (0.004839)	0.499670 / 0.540337 (-0.040668)	0.587577 / 1.386936 (-0.799359)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007866 / 0.011353 (-0.003486)	0.005798 / 0.011008 (-0.005210)	0.075803 / 0.038508 (0.037295)	0.035773 / 0.023109 (0.012664)	0.361965 / 0.275898 (0.086067)	0.402780 / 0.323480 (0.079300)	0.006521 / 0.007986 (-0.001465)	0.004613 / 0.004328 (0.000284)	0.075196 / 0.004250 (0.070946)	0.055324 / 0.037052 (0.018272)	0.363468 / 0.258489 (0.104979)	0.410344 / 0.293841 (0.116503)	0.036324 / 0.128546 (-0.092222)	0.012891 / 0.075646 (-0.062755)	0.086991 / 0.419271 (-0.332280)	0.048082 / 0.043533 (0.004549)	0.357238 / 0.255139 (0.102099)	0.377065 / 0.283200 (0.093865)	0.118586 / 0.141683 (-0.023097)	1.463161 / 1.452155 (0.011007)	1.582686 / 1.492716 (0.089969)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.267916 / 0.018006 (0.249909)	0.540862 / 0.000490 (0.540373)	0.003148 / 0.000200 (0.002948)	0.000101 / 0.000054 (0.000047)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032290 / 0.037411 (-0.005122)	0.115468 / 0.014526 (0.100943)	0.125743 / 0.176557 (-0.050814)	0.177469 / 0.737135 (-0.559667)	0.133579 / 0.296338 (-0.162759)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.446727 / 0.215209 (0.231518)	4.467938 / 2.077655 (2.390284)	2.330171 / 1.504120 (0.826052)	2.165624 / 1.541195 (0.624429)	2.298063 / 1.468490 (0.829573)	0.702241 / 4.584777 (-3.882536)	3.845302 / 3.745712 (0.099590)	2.169278 / 5.269862 (-3.100584)	1.401392 / 4.565676 (-3.164285)	0.086672 / 0.424275 (-0.337603)	0.012355 / 0.007607 (0.004748)	0.543639 / 0.226044 (0.317595)	5.425876 / 2.268929 (3.156947)	2.781794 / 55.444624 (-52.662831)	2.503724 / 6.876477 (-4.372752)	2.622580 / 2.142072 (0.480507)	0.847143 / 4.805227 (-3.958084)	0.171721 / 6.500664 (-6.328943)	0.067894 / 0.075469 (-0.007575)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.292194 / 1.841788 (-0.549594)	15.497311 / 8.074308 (7.423003)	15.002463 / 10.191392 (4.811071)	0.152244 / 0.680424 (-0.528180)	0.018085 / 0.534201 (-0.516116)	0.445787 / 0.579283 (-0.133496)	0.448960 / 0.434364 (0.014596)	0.515319 / 0.540337 (-0.025019)	0.623840 / 1.386936 (-0.763096)

HuggingFaceDocBuilderDev · 2023-05-04T17:29:15Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-05-09T18:23:16Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006938 / 0.011353 (-0.004415)	0.005100 / 0.011008 (-0.005909)	0.096525 / 0.038508 (0.058017)	0.033764 / 0.023109 (0.010655)	0.301107 / 0.275898 (0.025209)	0.333140 / 0.323480 (0.009660)	0.005719 / 0.007986 (-0.002266)	0.005192 / 0.004328 (0.000864)	0.073685 / 0.004250 (0.069434)	0.048149 / 0.037052 (0.011096)	0.299244 / 0.258489 (0.040754)	0.347518 / 0.293841 (0.053677)	0.034810 / 0.128546 (-0.093736)	0.012284 / 0.075646 (-0.063363)	0.333600 / 0.419271 (-0.085672)	0.050750 / 0.043533 (0.007217)	0.299782 / 0.255139 (0.044643)	0.322712 / 0.283200 (0.039512)	0.105659 / 0.141683 (-0.036024)	1.457536 / 1.452155 (0.005381)	1.571604 / 1.492716 (0.078887)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.207190 / 0.018006 (0.189184)	0.439230 / 0.000490 (0.438740)	0.006403 / 0.000200 (0.006203)	0.000282 / 0.000054 (0.000228)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027424 / 0.037411 (-0.009987)	0.107180 / 0.014526 (0.092655)	0.118356 / 0.176557 (-0.058201)	0.175557 / 0.737135 (-0.561579)	0.125671 / 0.296338 (-0.170668)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.411249 / 0.215209 (0.196039)	4.094494 / 2.077655 (2.016839)	1.946843 / 1.504120 (0.442723)	1.766503 / 1.541195 (0.225308)	1.831406 / 1.468490 (0.362916)	0.704637 / 4.584777 (-3.880140)	3.819204 / 3.745712 (0.073492)	3.412598 / 5.269862 (-1.857263)	1.796385 / 4.565676 (-2.769291)	0.084591 / 0.424275 (-0.339684)	0.012568 / 0.007607 (0.004961)	0.506372 / 0.226044 (0.280327)	5.049461 / 2.268929 (2.780532)	2.409860 / 55.444624 (-53.034765)	2.064514 / 6.876477 (-4.811963)	2.192808 / 2.142072 (0.050735)	0.833773 / 4.805227 (-3.971455)	0.167948 / 6.500664 (-6.332716)	0.064617 / 0.075469 (-0.010852)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.174739 / 1.841788 (-0.667048)	14.605634 / 8.074308 (6.531326)	14.321043 / 10.191392 (4.129651)	0.145892 / 0.680424 (-0.534532)	0.017413 / 0.534201 (-0.516788)	0.444940 / 0.579283 (-0.134343)	0.430792 / 0.434364 (-0.003572)	0.539699 / 0.540337 (-0.000638)	0.640279 / 1.386936 (-0.746657)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007159 / 0.011353 (-0.004194)	0.005313 / 0.011008 (-0.005695)	0.073630 / 0.038508 (0.035122)	0.033459 / 0.023109 (0.010350)	0.356959 / 0.275898 (0.081061)	0.385918 / 0.323480 (0.062438)	0.005714 / 0.007986 (-0.002272)	0.004074 / 0.004328 (-0.000254)	0.073278 / 0.004250 (0.069028)	0.047193 / 0.037052 (0.010140)	0.360300 / 0.258489 (0.101811)	0.398052 / 0.293841 (0.104212)	0.035670 / 0.128546 (-0.092876)	0.012499 / 0.075646 (-0.063147)	0.086677 / 0.419271 (-0.332595)	0.046534 / 0.043533 (0.003001)	0.370029 / 0.255139 (0.114890)	0.376040 / 0.283200 (0.092841)	0.105184 / 0.141683 (-0.036499)	1.419779 / 1.452155 (-0.032375)	1.538925 / 1.492716 (0.046209)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.220465 / 0.018006 (0.202459)	0.438836 / 0.000490 (0.438346)	0.000428 / 0.000200 (0.000228)	0.000060 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029114 / 0.037411 (-0.008298)	0.111871 / 0.014526 (0.097345)	0.124367 / 0.176557 (-0.052189)	0.173737 / 0.737135 (-0.563398)	0.128435 / 0.296338 (-0.167904)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.440706 / 0.215209 (0.225497)	4.414826 / 2.077655 (2.337171)	2.128899 / 1.504120 (0.624780)	1.929551 / 1.541195 (0.388357)	2.013130 / 1.468490 (0.544640)	0.708566 / 4.584777 (-3.876211)	3.846459 / 3.745712 (0.100747)	2.158829 / 5.269862 (-3.111032)	1.339454 / 4.565676 (-3.226223)	0.086345 / 0.424275 (-0.337930)	0.012085 / 0.007607 (0.004478)	0.546360 / 0.226044 (0.320316)	5.461612 / 2.268929 (3.192683)	2.657388 / 55.444624 (-52.787237)	2.298403 / 6.876477 (-4.578074)	2.344572 / 2.142072 (0.202499)	0.844276 / 4.805227 (-3.960951)	0.170225 / 6.500664 (-6.330439)	0.064684 / 0.075469 (-0.010785)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.265114 / 1.841788 (-0.576674)	15.058156 / 8.074308 (6.983848)	14.485182 / 10.191392 (4.293790)	0.165960 / 0.680424 (-0.514464)	0.017481 / 0.534201 (-0.516719)	0.425141 / 0.579283 (-0.154142)	0.434883 / 0.434364 (0.000519)	0.506701 / 0.540337 (-0.033637)	0.613240 / 1.386936 (-0.773697)

github-actions · 2023-05-10T12:11:19Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007651 / 0.011353 (-0.003702)	0.005503 / 0.011008 (-0.005505)	0.098751 / 0.038508 (0.060243)	0.036822 / 0.023109 (0.013713)	0.340754 / 0.275898 (0.064856)	0.387247 / 0.323480 (0.063767)	0.006513 / 0.007986 (-0.001473)	0.006135 / 0.004328 (0.001807)	0.073656 / 0.004250 (0.069406)	0.055508 / 0.037052 (0.018456)	0.352493 / 0.258489 (0.094004)	0.408003 / 0.293841 (0.114162)	0.036346 / 0.128546 (-0.092201)	0.012562 / 0.075646 (-0.063085)	0.335111 / 0.419271 (-0.084160)	0.051928 / 0.043533 (0.008395)	0.339405 / 0.255139 (0.084266)	0.366840 / 0.283200 (0.083640)	0.114353 / 0.141683 (-0.027330)	1.449062 / 1.452155 (-0.003092)	1.567310 / 1.492716 (0.074594)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.262975 / 0.018006 (0.244968)	0.570302 / 0.000490 (0.569813)	0.003419 / 0.000200 (0.003219)	0.000100 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027363 / 0.037411 (-0.010049)	0.109033 / 0.014526 (0.094507)	0.119048 / 0.176557 (-0.057509)	0.175891 / 0.737135 (-0.561244)	0.124577 / 0.296338 (-0.171762)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.397988 / 0.215209 (0.182779)	3.993210 / 2.077655 (1.915555)	1.809275 / 1.504120 (0.305155)	1.614664 / 1.541195 (0.073469)	1.723650 / 1.468490 (0.255159)	0.698484 / 4.584777 (-3.886293)	3.914135 / 3.745712 (0.168423)	2.142622 / 5.269862 (-3.127239)	1.360215 / 4.565676 (-3.205461)	0.086340 / 0.424275 (-0.337935)	0.012836 / 0.007607 (0.005229)	0.500728 / 0.226044 (0.274684)	5.006744 / 2.268929 (2.737815)	2.350668 / 55.444624 (-53.093956)	1.979816 / 6.876477 (-4.896660)	2.190159 / 2.142072 (0.048087)	0.854063 / 4.805227 (-3.951164)	0.170203 / 6.500664 (-6.330461)	0.066903 / 0.075469 (-0.008566)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.184012 / 1.841788 (-0.657775)	15.407350 / 8.074308 (7.333042)	14.758180 / 10.191392 (4.566788)	0.169280 / 0.680424 (-0.511144)	0.017419 / 0.534201 (-0.516781)	0.434359 / 0.579283 (-0.144925)	0.442515 / 0.434364 (0.008151)	0.503132 / 0.540337 (-0.037205)	0.602589 / 1.386936 (-0.784347)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008022 / 0.011353 (-0.003331)	0.005473 / 0.011008 (-0.005535)	0.076106 / 0.038508 (0.037598)	0.037065 / 0.023109 (0.013956)	0.380039 / 0.275898 (0.104141)	0.394205 / 0.323480 (0.070725)	0.006447 / 0.007986 (-0.001539)	0.006011 / 0.004328 (0.001682)	0.075236 / 0.004250 (0.070985)	0.054425 / 0.037052 (0.017372)	0.381707 / 0.258489 (0.123218)	0.411237 / 0.293841 (0.117396)	0.037222 / 0.128546 (-0.091324)	0.012627 / 0.075646 (-0.063020)	0.086733 / 0.419271 (-0.332538)	0.053857 / 0.043533 (0.010324)	0.373374 / 0.255139 (0.118235)	0.381680 / 0.283200 (0.098480)	0.121962 / 0.141683 (-0.019721)	1.430804 / 1.452155 (-0.021351)	1.562517 / 1.492716 (0.069801)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.262034 / 0.018006 (0.244028)	0.563497 / 0.000490 (0.563007)	0.002726 / 0.000200 (0.002526)	0.000099 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031071 / 0.037411 (-0.006341)	0.111983 / 0.014526 (0.097457)	0.126634 / 0.176557 (-0.049923)	0.177511 / 0.737135 (-0.559625)	0.132599 / 0.296338 (-0.163739)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.436148 / 0.215209 (0.220939)	4.344850 / 2.077655 (2.267195)	2.105877 / 1.504120 (0.601757)	1.920934 / 1.541195 (0.379739)	2.072930 / 1.468490 (0.604440)	0.701793 / 4.584777 (-3.882984)	3.841621 / 3.745712 (0.095909)	3.602550 / 5.269862 (-1.667311)	1.775999 / 4.565676 (-2.789677)	0.086024 / 0.424275 (-0.338251)	0.012275 / 0.007607 (0.004668)	0.532815 / 0.226044 (0.306770)	5.336273 / 2.268929 (3.067344)	2.638842 / 55.444624 (-52.805782)	2.301842 / 6.876477 (-4.574635)	2.407448 / 2.142072 (0.265376)	0.855836 / 4.805227 (-3.949392)	0.170348 / 6.500664 (-6.330317)	0.066926 / 0.075469 (-0.008543)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.291515 / 1.841788 (-0.550272)	15.869825 / 8.074308 (7.795517)	15.068227 / 10.191392 (4.876835)	0.156953 / 0.680424 (-0.523471)	0.017761 / 0.534201 (-0.516440)	0.429515 / 0.579283 (-0.149768)	0.432758 / 0.434364 (-0.001605)	0.500080 / 0.540337 (-0.040258)	0.601451 / 1.386936 (-0.785485)

lhoestq · 2023-05-19T13:03:25Z

Will need to take #5810 into account if it gets merged before this one

github-actions · 2023-05-24T15:27:32Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006914 / 0.011353 (-0.004439)	0.004727 / 0.011008 (-0.006281)	0.098880 / 0.038508 (0.060372)	0.036663 / 0.023109 (0.013554)	0.317575 / 0.275898 (0.041677)	0.360301 / 0.323480 (0.036821)	0.006084 / 0.007986 (-0.001901)	0.004118 / 0.004328 (-0.000210)	0.074330 / 0.004250 (0.070079)	0.042422 / 0.037052 (0.005369)	0.335625 / 0.258489 (0.077136)	0.366616 / 0.293841 (0.072775)	0.028523 / 0.128546 (-0.100023)	0.008883 / 0.075646 (-0.066763)	0.332475 / 0.419271 (-0.086797)	0.051746 / 0.043533 (0.008214)	0.324952 / 0.255139 (0.069813)	0.339660 / 0.283200 (0.056460)	0.103714 / 0.141683 (-0.037969)	1.472130 / 1.452155 (0.019976)	1.516548 / 1.492716 (0.023831)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.229538 / 0.018006 (0.211532)	0.449077 / 0.000490 (0.448588)	0.003707 / 0.000200 (0.003507)	0.000086 / 0.000054 (0.000032)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027897 / 0.037411 (-0.009514)	0.115452 / 0.014526 (0.100926)	0.118830 / 0.176557 (-0.057726)	0.176228 / 0.737135 (-0.560907)	0.125966 / 0.296338 (-0.170372)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.436947 / 0.215209 (0.221738)	4.355687 / 2.077655 (2.278033)	2.195857 / 1.504120 (0.691737)	2.028133 / 1.541195 (0.486938)	2.119872 / 1.468490 (0.651382)	0.524256 / 4.584777 (-4.060521)	3.864064 / 3.745712 (0.118352)	3.446181 / 5.269862 (-1.823680)	1.610307 / 4.565676 (-2.955370)	0.065981 / 0.424275 (-0.358294)	0.012172 / 0.007607 (0.004565)	0.545341 / 0.226044 (0.319297)	5.451728 / 2.268929 (3.182800)	2.690734 / 55.444624 (-52.753890)	2.368203 / 6.876477 (-4.508274)	2.549533 / 2.142072 (0.407460)	0.651296 / 4.805227 (-4.153931)	0.143697 / 6.500664 (-6.356968)	0.065170 / 0.075469 (-0.010299)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.198898 / 1.841788 (-0.642890)	15.349348 / 8.074308 (7.275040)	15.314467 / 10.191392 (5.123075)	0.177219 / 0.680424 (-0.503205)	0.018223 / 0.534201 (-0.515978)	0.396209 / 0.579283 (-0.183074)	0.427810 / 0.434364 (-0.006554)	0.475107 / 0.540337 (-0.065230)	0.561224 / 1.386936 (-0.825712)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007024 / 0.011353 (-0.004329)	0.004851 / 0.011008 (-0.006157)	0.075031 / 0.038508 (0.036523)	0.036411 / 0.023109 (0.013302)	0.375999 / 0.275898 (0.100101)	0.433033 / 0.323480 (0.109553)	0.006089 / 0.007986 (-0.001897)	0.005638 / 0.004328 (0.001309)	0.072599 / 0.004250 (0.068348)	0.048489 / 0.037052 (0.011436)	0.381807 / 0.258489 (0.123318)	0.441531 / 0.293841 (0.147691)	0.029044 / 0.128546 (-0.099503)	0.009052 / 0.075646 (-0.066595)	0.080086 / 0.419271 (-0.339186)	0.046919 / 0.043533 (0.003386)	0.360399 / 0.255139 (0.105260)	0.405445 / 0.283200 (0.122245)	0.108815 / 0.141683 (-0.032868)	1.415168 / 1.452155 (-0.036987)	1.511756 / 1.492716 (0.019040)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.210287 / 0.018006 (0.192281)	0.445139 / 0.000490 (0.444650)	0.000386 / 0.000200 (0.000186)	0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030457 / 0.037411 (-0.006954)	0.117225 / 0.014526 (0.102699)	0.122833 / 0.176557 (-0.053724)	0.170441 / 0.737135 (-0.566694)	0.131589 / 0.296338 (-0.164750)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.446541 / 0.215209 (0.231332)	4.471214 / 2.077655 (2.393560)	2.145894 / 1.504120 (0.641774)	1.958113 / 1.541195 (0.416919)	2.069623 / 1.468490 (0.601132)	0.527562 / 4.584777 (-4.057215)	3.838285 / 3.745712 (0.092573)	1.884780 / 5.269862 (-3.385081)	1.088124 / 4.565676 (-3.477553)	0.066099 / 0.424275 (-0.358176)	0.011973 / 0.007607 (0.004366)	0.540369 / 0.226044 (0.314325)	5.403554 / 2.268929 (3.134626)	2.749920 / 55.444624 (-52.694704)	2.543169 / 6.876477 (-4.333308)	2.403116 / 2.142072 (0.261043)	0.638723 / 4.805227 (-4.166505)	0.142232 / 6.500664 (-6.358432)	0.065551 / 0.075469 (-0.009918)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.298307 / 1.841788 (-0.543481)	15.986177 / 8.074308 (7.911869)	15.530453 / 10.191392 (5.339061)	0.160138 / 0.680424 (-0.520286)	0.017988 / 0.534201 (-0.516213)	0.397857 / 0.579283 (-0.181427)	0.435071 / 0.434364 (0.000707)	0.480096 / 0.540337 (-0.060241)	0.589139 / 1.386936 (-0.797797)

github-actions · 2023-05-24T15:32:20Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006976 / 0.011353 (-0.004377)	0.005068 / 0.011008 (-0.005940)	0.098178 / 0.038508 (0.059670)	0.035167 / 0.023109 (0.012057)	0.324093 / 0.275898 (0.048195)	0.350749 / 0.323480 (0.027269)	0.006128 / 0.007986 (-0.001858)	0.004361 / 0.004328 (0.000033)	0.075412 / 0.004250 (0.071161)	0.052083 / 0.037052 (0.015031)	0.326726 / 0.258489 (0.068237)	0.371450 / 0.293841 (0.077609)	0.028522 / 0.128546 (-0.100025)	0.009210 / 0.075646 (-0.066436)	0.329296 / 0.419271 (-0.089976)	0.051182 / 0.043533 (0.007649)	0.319863 / 0.255139 (0.064724)	0.329140 / 0.283200 (0.045941)	0.111653 / 0.141683 (-0.030030)	1.464205 / 1.452155 (0.012050)	1.555779 / 1.492716 (0.063062)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.282372 / 0.018006 (0.264366)	0.569227 / 0.000490 (0.568737)	0.005289 / 0.000200 (0.005089)	0.000095 / 0.000054 (0.000041)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029875 / 0.037411 (-0.007537)	0.111889 / 0.014526 (0.097364)	0.125678 / 0.176557 (-0.050878)	0.184695 / 0.737135 (-0.552441)	0.129737 / 0.296338 (-0.166602)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.417404 / 0.215209 (0.202195)	4.172367 / 2.077655 (2.094712)	2.008088 / 1.504120 (0.503968)	1.813182 / 1.541195 (0.271988)	1.882727 / 1.468490 (0.414237)	0.525764 / 4.584777 (-4.059013)	3.815202 / 3.745712 (0.069490)	1.884197 / 5.269862 (-3.385664)	1.073779 / 4.565676 (-3.491897)	0.066125 / 0.424275 (-0.358150)	0.012473 / 0.007607 (0.004866)	0.522197 / 0.226044 (0.296153)	5.218486 / 2.268929 (2.949557)	2.413846 / 55.444624 (-53.030779)	2.093298 / 6.876477 (-4.783179)	2.320583 / 2.142072 (0.178511)	0.648832 / 4.805227 (-4.156395)	0.146168 / 6.500664 (-6.354496)	0.065869 / 0.075469 (-0.009600)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.181859 / 1.841788 (-0.659929)	15.369517 / 8.074308 (7.295209)	14.896270 / 10.191392 (4.704878)	0.146793 / 0.680424 (-0.533630)	0.017960 / 0.534201 (-0.516241)	0.421801 / 0.579283 (-0.157482)	0.438357 / 0.434364 (0.003993)	0.524554 / 0.540337 (-0.015783)	0.621041 / 1.386936 (-0.765895)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007104 / 0.011353 (-0.004249)	0.004895 / 0.011008 (-0.006113)	0.075641 / 0.038508 (0.037133)	0.034821 / 0.023109 (0.011712)	0.363875 / 0.275898 (0.087977)	0.403042 / 0.323480 (0.079562)	0.006747 / 0.007986 (-0.001238)	0.005793 / 0.004328 (0.001465)	0.074709 / 0.004250 (0.070458)	0.058801 / 0.037052 (0.021749)	0.366900 / 0.258489 (0.108411)	0.414442 / 0.293841 (0.120601)	0.029099 / 0.128546 (-0.099448)	0.009394 / 0.075646 (-0.066253)	0.082612 / 0.419271 (-0.336659)	0.049076 / 0.043533 (0.005543)	0.358828 / 0.255139 (0.103689)	0.378261 / 0.283200 (0.095061)	0.122147 / 0.141683 (-0.019535)	1.454155 / 1.452155 (0.002000)	1.572437 / 1.492716 (0.079720)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.293133 / 0.018006 (0.275127)	0.536785 / 0.000490 (0.536295)	0.000457 / 0.000200 (0.000257)	0.000058 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031046 / 0.037411 (-0.006366)	0.113929 / 0.014526 (0.099403)	0.126222 / 0.176557 (-0.050335)	0.173992 / 0.737135 (-0.563143)	0.129635 / 0.296338 (-0.166704)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.441984 / 0.215209 (0.226775)	4.406002 / 2.077655 (2.328348)	2.173912 / 1.504120 (0.669792)	2.000507 / 1.541195 (0.459312)	2.172766 / 1.468490 (0.704276)	0.524530 / 4.584777 (-4.060247)	3.758827 / 3.745712 (0.013115)	1.886701 / 5.269862 (-3.383160)	1.073601 / 4.565676 (-3.492075)	0.066137 / 0.424275 (-0.358139)	0.011926 / 0.007607 (0.004319)	0.541103 / 0.226044 (0.315059)	5.404162 / 2.268929 (3.135233)	2.634271 / 55.444624 (-52.810354)	2.366156 / 6.876477 (-4.510321)	2.566877 / 2.142072 (0.424804)	0.639088 / 4.805227 (-4.166139)	0.141810 / 6.500664 (-6.358854)	0.065446 / 0.075469 (-0.010023)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.288173 / 1.841788 (-0.553614)	15.897051 / 8.074308 (7.822743)	15.243404 / 10.191392 (5.052012)	0.162380 / 0.680424 (-0.518043)	0.017716 / 0.534201 (-0.516485)	0.396400 / 0.579283 (-0.182883)	0.420479 / 0.434364 (-0.013885)	0.476238 / 0.540337 (-0.064099)	0.583039 / 1.386936 (-0.803897)

albertvillanova

Thanks for the enhancement!!
Some comments below...

albertvillanova · 2023-05-26T13:39:49Z

src/datasets/arrow_dataset.py

-        ex_iterable = ExamplesIterable(Dataset._generate_examples_from_shards, kwargs={"shards": shards})
+        ex_iterable = ArrowExamplesIterable(
+            Dataset._generate_tables_from_shards,
+            kwargs={"shards": shards, "batch_size": config.DEFAULT_MAX_BATCH_SIZE},


I am wondering if we should support users to pass a custom batch_size.

yea I'm not sure - we can wait for some feedback on this and improve later imo

albertvillanova · 2023-05-26T13:40:19Z

src/datasets/iterable_dataset.py

+def _convert_to_arrow(
+    iterable: Iterable[Tuple[Key, dict]],
+    batch_size: int,
+    drop_last_batch=False,


Missing type hint here.

albertvillanova · 2023-05-26T14:03:08Z

src/datasets/iterable_dataset.py

+    batch_size: int,
+    drop_last_batch=False,
+) -> Iterator[Tuple[Key, pa.Table]]:
+    """Iterate over sub-tables of size `batch_size`.


The definition in the docstring is the same as the one for _batch_arrow_tables below. Maybe we should add to the description they expect different iterables...

Maybe also the naming of both functions could be aligned: now they have very different names whereas their functionality is analogous...

I like the idea of having different names because one is expensive (contains "convert" in the name) while the other one is not (simply batches data)

src/datasets/iterable_dataset.py

lhoestq · 2023-05-26T18:03:56Z

I fixed the docstring and type hint

github-actions · 2023-05-26T18:10:52Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006310 / 0.011353 (-0.005043)	0.004297 / 0.011008 (-0.006711)	0.098288 / 0.038508 (0.059780)	0.029295 / 0.023109 (0.006185)	0.386804 / 0.275898 (0.110906)	0.425717 / 0.323480 (0.102237)	0.005516 / 0.007986 (-0.002470)	0.005058 / 0.004328 (0.000730)	0.074318 / 0.004250 (0.070068)	0.040609 / 0.037052 (0.003557)	0.388159 / 0.258489 (0.129670)	0.428683 / 0.293841 (0.134842)	0.026207 / 0.128546 (-0.102340)	0.008655 / 0.075646 (-0.066991)	0.321601 / 0.419271 (-0.097671)	0.055329 / 0.043533 (0.011796)	0.390452 / 0.255139 (0.135313)	0.409084 / 0.283200 (0.125884)	0.099555 / 0.141683 (-0.042128)	1.484289 / 1.452155 (0.032134)	1.549892 / 1.492716 (0.057176)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.219466 / 0.018006 (0.201460)	0.437288 / 0.000490 (0.436798)	0.003556 / 0.000200 (0.003356)	0.000080 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023876 / 0.037411 (-0.013535)	0.100205 / 0.014526 (0.085679)	0.106365 / 0.176557 (-0.070191)	0.164353 / 0.737135 (-0.572782)	0.109987 / 0.296338 (-0.186352)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.418819 / 0.215209 (0.203610)	4.168558 / 2.077655 (2.090903)	1.862883 / 1.504120 (0.358764)	1.673308 / 1.541195 (0.132114)	1.742338 / 1.468490 (0.273848)	0.550113 / 4.584777 (-4.034664)	3.492085 / 3.745712 (-0.253627)	1.734579 / 5.269862 (-3.535283)	1.006876 / 4.565676 (-3.558801)	0.068014 / 0.424275 (-0.356261)	0.012242 / 0.007607 (0.004634)	0.520633 / 0.226044 (0.294588)	5.214095 / 2.268929 (2.945167)	2.319282 / 55.444624 (-53.125343)	1.979521 / 6.876477 (-4.896956)	2.099595 / 2.142072 (-0.042477)	0.659306 / 4.805227 (-4.145921)	0.135282 / 6.500664 (-6.365382)	0.067417 / 0.075469 (-0.008052)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.232099 / 1.841788 (-0.609689)	13.967219 / 8.074308 (5.892910)	14.347105 / 10.191392 (4.155713)	0.146360 / 0.680424 (-0.534063)	0.017021 / 0.534201 (-0.517180)	0.363254 / 0.579283 (-0.216030)	0.404391 / 0.434364 (-0.029973)	0.428670 / 0.540337 (-0.111668)	0.514942 / 1.386936 (-0.871994)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006360 / 0.011353 (-0.004993)	0.004160 / 0.011008 (-0.006848)	0.074856 / 0.038508 (0.036347)	0.028624 / 0.023109 (0.005515)	0.355624 / 0.275898 (0.079726)	0.403678 / 0.323480 (0.080198)	0.005253 / 0.007986 (-0.002732)	0.004808 / 0.004328 (0.000480)	0.074215 / 0.004250 (0.069964)	0.040641 / 0.037052 (0.003588)	0.358473 / 0.258489 (0.099984)	0.414442 / 0.293841 (0.120601)	0.025595 / 0.128546 (-0.102951)	0.008506 / 0.075646 (-0.067140)	0.081547 / 0.419271 (-0.337725)	0.039719 / 0.043533 (-0.003814)	0.355420 / 0.255139 (0.100281)	0.380953 / 0.283200 (0.097753)	0.100064 / 0.141683 (-0.041618)	1.459639 / 1.452155 (0.007484)	1.557288 / 1.492716 (0.064572)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.232837 / 0.018006 (0.214831)	0.424788 / 0.000490 (0.424298)	0.000397 / 0.000200 (0.000197)	0.000059 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026156 / 0.037411 (-0.011256)	0.103633 / 0.014526 (0.089107)	0.109633 / 0.176557 (-0.066923)	0.159407 / 0.737135 (-0.577728)	0.113874 / 0.296338 (-0.182465)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.471980 / 0.215209 (0.256771)	4.724424 / 2.077655 (2.646769)	2.459950 / 1.504120 (0.955830)	2.280926 / 1.541195 (0.739731)	2.368478 / 1.468490 (0.899987)	0.552809 / 4.584777 (-4.031968)	3.461985 / 3.745712 (-0.283728)	1.757060 / 5.269862 (-3.512802)	1.009599 / 4.565676 (-3.556077)	0.068407 / 0.424275 (-0.355868)	0.012341 / 0.007607 (0.004734)	0.576287 / 0.226044 (0.350242)	5.767331 / 2.268929 (3.498402)	2.965743 / 55.444624 (-52.478882)	2.644935 / 6.876477 (-4.231542)	2.699663 / 2.142072 (0.557591)	0.656005 / 4.805227 (-4.149222)	0.136315 / 6.500664 (-6.364349)	0.068355 / 0.075469 (-0.007114)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.308301 / 1.841788 (-0.533486)	14.587268 / 8.074308 (6.512960)	14.385670 / 10.191392 (4.194278)	0.148154 / 0.680424 (-0.532270)	0.016798 / 0.534201 (-0.517402)	0.360761 / 0.579283 (-0.218523)	0.392566 / 0.434364 (-0.041798)	0.431604 / 0.540337 (-0.108734)	0.528463 / 1.386936 (-0.858473)

lhoestq · 2023-05-30T09:14:52Z

let me know if it sounds good for you now @albertvillanova :)

albertvillanova

I think you could update the IterableDataset.with_format docstring, now thta it supports the "arrow" format as well...

albertvillanova

Just a comment about _batch_arrow_tables below.

albertvillanova · 2023-05-30T16:02:43Z

src/datasets/iterable_dataset.py

+    keys_buffer = []
+    chunks_buffer = []
+    chunks_buffer_size = 0
+    for key, pa_table in iterable:
+        for chunk in pa_table.to_reader(max_chunksize=batch_size):
+            if len(chunk) == 0:
+                continue
+            elif chunks_buffer_size + len(chunk) < batch_size:
+                keys_buffer.append(key)
+                chunks_buffer.append(chunk)
+                chunks_buffer_size += len(chunk)
+                continue
+            elif chunks_buffer_size + len(chunk) == batch_size:
+                keys_buffer.append(key)
+                chunks_buffer.append(chunk)
+                new_key = "_".join(str(_key) for _key in keys_buffer)
+                yield new_key, pa.Table.from_batches(chunks_buffer)
+                keys_buffer = []
+                chunks_buffer = []
+                chunks_buffer_size = 0
+            else:
+                cropped_chunk_length = batch_size - chunks_buffer_size
+                keys_buffer.append(f"{key}[:{cropped_chunk_length}]")
+                chunks_buffer.append(chunk.slice(0, cropped_chunk_length))
+                new_key = "_".join(str(_key) for _key in keys_buffer)
+                yield new_key, pa.Table.from_batches(chunks_buffer)
+                keys_buffer = [f"{key}[{cropped_chunk_length}:]"]
+                chunks_buffer = [chunk.slice(cropped_chunk_length, len(chunk) - cropped_chunk_length)]
+                chunks_buffer_size = len(chunk) - cropped_chunk_length
+    if not drop_last_batch and chunks_buffer:
+        new_key = "_".join(str(_key) for _key in keys_buffer)
+        yield new_key, pa.Table.from_batches(chunks_buffer)


This function just rebatches, from the original batch size returned by self.generate_tables_fn(**self.kwargs), to the target batch_size. And to do that, it generates batches from Tables (using Table.to_reader) and then Tables from batches (using Table.from_batches).

I am just wondering it the implementation could be simpler... Not sure though.

What do you think about a naive approach like?

iterator = iter(pa_table.slice(i, 1) for _, pa_table in iterable for i in range(pa_table.num_rows)) batch = True key = 0 while batch: batch = list(islice(iterator, batch_size)) if batch: if drop_last_batch and len(batch) < batch_size: continue yield key, pa.concat_tables(batch) key += 1

I think iterating on batches of size 1 and grouping them again is slower. It may also return tables with too many record batches (1 batch per row) to be efficiently used

Then, feel free to merge as it is...

github-actions · 2023-05-30T16:19:26Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008414 / 0.011353 (-0.002939)	0.005320 / 0.011008 (-0.005688)	0.115585 / 0.038508 (0.077077)	0.040815 / 0.023109 (0.017706)	0.363453 / 0.275898 (0.087555)	0.385954 / 0.323480 (0.062474)	0.006463 / 0.007986 (-0.001523)	0.005571 / 0.004328 (0.001242)	0.084831 / 0.004250 (0.080581)	0.050294 / 0.037052 (0.013242)	0.375684 / 0.258489 (0.117195)	0.394672 / 0.293841 (0.100831)	0.033618 / 0.128546 (-0.094928)	0.010451 / 0.075646 (-0.065195)	0.388937 / 0.419271 (-0.030334)	0.059974 / 0.043533 (0.016441)	0.360437 / 0.255139 (0.105298)	0.375149 / 0.283200 (0.091950)	0.118397 / 0.141683 (-0.023286)	1.726759 / 1.452155 (0.274604)	1.811928 / 1.492716 (0.319212)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.239186 / 0.018006 (0.221180)	0.483728 / 0.000490 (0.483238)	0.003285 / 0.000200 (0.003085)	0.000097 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030514 / 0.037411 (-0.006898)	0.127111 / 0.014526 (0.112585)	0.136185 / 0.176557 (-0.040371)	0.204541 / 0.737135 (-0.532594)	0.143228 / 0.296338 (-0.153111)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.465840 / 0.215209 (0.250631)	4.611160 / 2.077655 (2.533506)	2.119307 / 1.504120 (0.615187)	1.882463 / 1.541195 (0.341268)	1.946067 / 1.468490 (0.477577)	0.602352 / 4.584777 (-3.982425)	4.576313 / 3.745712 (0.830601)	2.112860 / 5.269862 (-3.157001)	1.224388 / 4.565676 (-3.341289)	0.073808 / 0.424275 (-0.350467)	0.013157 / 0.007607 (0.005550)	0.592208 / 0.226044 (0.366163)	5.948971 / 2.268929 (3.680042)	2.690144 / 55.444624 (-52.754480)	2.236489 / 6.876477 (-4.639987)	2.423617 / 2.142072 (0.281545)	0.752053 / 4.805227 (-4.053175)	0.168185 / 6.500664 (-6.332480)	0.075454 / 0.075469 (-0.000015)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.407432 / 1.841788 (-0.434356)	17.054545 / 8.074308 (8.980236)	15.661362 / 10.191392 (5.469970)	0.175027 / 0.680424 (-0.505397)	0.020262 / 0.534201 (-0.513939)	0.479052 / 0.579283 (-0.100231)	0.509829 / 0.434364 (0.075465)	0.601935 / 0.540337 (0.061598)	0.726754 / 1.386936 (-0.660182)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007698 / 0.011353 (-0.003655)	0.005267 / 0.011008 (-0.005741)	0.085832 / 0.038508 (0.047324)	0.041974 / 0.023109 (0.018865)	0.418966 / 0.275898 (0.143068)	0.466314 / 0.323480 (0.142834)	0.006580 / 0.007986 (-0.001406)	0.007063 / 0.004328 (0.002735)	0.087120 / 0.004250 (0.082870)	0.054908 / 0.037052 (0.017856)	0.423813 / 0.258489 (0.165323)	0.489878 / 0.293841 (0.196037)	0.032823 / 0.128546 (-0.095723)	0.010471 / 0.075646 (-0.065175)	0.095839 / 0.419271 (-0.323432)	0.056421 / 0.043533 (0.012888)	0.420526 / 0.255139 (0.165387)	0.447975 / 0.283200 (0.164775)	0.126604 / 0.141683 (-0.015079)	1.723097 / 1.452155 (0.270942)	1.819539 / 1.492716 (0.326822)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.279604 / 0.018006 (0.261598)	0.496129 / 0.000490 (0.495639)	0.005419 / 0.000200 (0.005219)	0.000096 / 0.000054 (0.000041)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035069 / 0.037411 (-0.002343)	0.133064 / 0.014526 (0.118538)	0.145404 / 0.176557 (-0.031152)	0.205237 / 0.737135 (-0.531898)	0.150684 / 0.296338 (-0.145654)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.513596 / 0.215209 (0.298387)	5.104861 / 2.077655 (3.027206)	2.487908 / 1.504120 (0.983788)	2.271383 / 1.541195 (0.730188)	2.421043 / 1.468490 (0.952553)	0.625204 / 4.584777 (-3.959573)	4.555389 / 3.745712 (0.809677)	4.181518 / 5.269862 (-1.088344)	1.676059 / 4.565676 (-2.889617)	0.078786 / 0.424275 (-0.345489)	0.014186 / 0.007607 (0.006579)	0.638360 / 0.226044 (0.412315)	6.367915 / 2.268929 (4.098986)	3.095175 / 55.444624 (-52.349449)	2.706707 / 6.876477 (-4.169769)	2.735907 / 2.142072 (0.593835)	0.756323 / 4.805227 (-4.048905)	0.164783 / 6.500664 (-6.335881)	0.076291 / 0.075469 (0.000822)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.667058 / 1.841788 (-0.174730)	18.687459 / 8.074308 (10.613151)	17.111596 / 10.191392 (6.920204)	0.167218 / 0.680424 (-0.513206)	0.020995 / 0.534201 (-0.513206)	0.463985 / 0.579283 (-0.115298)	0.502705 / 0.434364 (0.068341)	0.562877 / 0.540337 (0.022540)	0.682249 / 1.386936 (-0.704687)

albertvillanova

After your commit 028822a, you say with_format only supports "arrow" or None.

Maybe we should fix all the tests in test_iterable_dataset.py that contain .with_format("torch")?

Otherwise, feel free to merge as it is.

lhoestq · 2023-05-31T09:36:06Z

Maybe we should fix all the tests in test_iterable_dataset.py that contain .with_format("torch")?

they're updated in #5852

github-actions · 2023-05-31T09:43:25Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005931 / 0.011353 (-0.005421)	0.004004 / 0.011008 (-0.007004)	0.098632 / 0.038508 (0.060124)	0.027820 / 0.023109 (0.004711)	0.302944 / 0.275898 (0.027046)	0.332684 / 0.323480 (0.009204)	0.005529 / 0.007986 (-0.002457)	0.004814 / 0.004328 (0.000485)	0.074477 / 0.004250 (0.070227)	0.034875 / 0.037052 (-0.002178)	0.304542 / 0.258489 (0.046053)	0.342853 / 0.293841 (0.049012)	0.025263 / 0.128546 (-0.103283)	0.008558 / 0.075646 (-0.067089)	0.322522 / 0.419271 (-0.096750)	0.043980 / 0.043533 (0.000447)	0.306618 / 0.255139 (0.051479)	0.331692 / 0.283200 (0.048492)	0.087434 / 0.141683 (-0.054248)	1.464686 / 1.452155 (0.012531)	1.575038 / 1.492716 (0.082322)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.221920 / 0.018006 (0.203914)	0.417108 / 0.000490 (0.416619)	0.004625 / 0.000200 (0.004425)	0.000079 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023493 / 0.037411 (-0.013918)	0.096684 / 0.014526 (0.082158)	0.102035 / 0.176557 (-0.074522)	0.166609 / 0.737135 (-0.570526)	0.107456 / 0.296338 (-0.188883)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.418713 / 0.215209 (0.203504)	4.156913 / 2.077655 (2.079258)	1.869064 / 1.504120 (0.364944)	1.666219 / 1.541195 (0.125024)	1.676491 / 1.468490 (0.208001)	0.553843 / 4.584777 (-4.030934)	3.380471 / 3.745712 (-0.365241)	2.970370 / 5.269862 (-2.299491)	1.421597 / 4.565676 (-3.144080)	0.068019 / 0.424275 (-0.356256)	0.012995 / 0.007607 (0.005387)	0.519410 / 0.226044 (0.293365)	5.198251 / 2.268929 (2.929323)	2.352969 / 55.444624 (-53.091655)	2.008981 / 6.876477 (-4.867496)	2.066519 / 2.142072 (-0.075553)	0.658982 / 4.805227 (-4.146245)	0.134341 / 6.500664 (-6.366323)	0.065893 / 0.075469 (-0.009576)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.207509 / 1.841788 (-0.634279)	13.863838 / 8.074308 (5.789530)	13.363359 / 10.191392 (3.171967)	0.129076 / 0.680424 (-0.551348)	0.016818 / 0.534201 (-0.517383)	0.357956 / 0.579283 (-0.221327)	0.386174 / 0.434364 (-0.048189)	0.418663 / 0.540337 (-0.121674)	0.498708 / 1.386936 (-0.888228)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006132 / 0.011353 (-0.005220)	0.004335 / 0.011008 (-0.006673)	0.078517 / 0.038508 (0.040009)	0.027685 / 0.023109 (0.004576)	0.357956 / 0.275898 (0.082058)	0.392397 / 0.323480 (0.068918)	0.005364 / 0.007986 (-0.002622)	0.004922 / 0.004328 (0.000593)	0.078061 / 0.004250 (0.073810)	0.038889 / 0.037052 (0.001837)	0.360952 / 0.258489 (0.102463)	0.402790 / 0.293841 (0.108949)	0.025542 / 0.128546 (-0.103004)	0.008718 / 0.075646 (-0.066929)	0.085799 / 0.419271 (-0.333472)	0.044256 / 0.043533 (0.000723)	0.358366 / 0.255139 (0.103227)	0.393500 / 0.283200 (0.110300)	0.096382 / 0.141683 (-0.045301)	1.530889 / 1.452155 (0.078735)	1.621007 / 1.492716 (0.128291)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.180572 / 0.018006 (0.162566)	0.429478 / 0.000490 (0.428988)	0.002966 / 0.000200 (0.002766)	0.000074 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024530 / 0.037411 (-0.012881)	0.101401 / 0.014526 (0.086875)	0.108208 / 0.176557 (-0.068349)	0.159582 / 0.737135 (-0.577554)	0.111170 / 0.296338 (-0.185168)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.465768 / 0.215209 (0.250559)	4.706311 / 2.077655 (2.628656)	2.437756 / 1.504120 (0.933636)	2.245694 / 1.541195 (0.704499)	2.282637 / 1.468490 (0.814147)	0.552752 / 4.584777 (-4.032025)	3.432992 / 3.745712 (-0.312720)	1.800054 / 5.269862 (-3.469808)	1.037852 / 4.565676 (-3.527824)	0.068240 / 0.424275 (-0.356035)	0.012433 / 0.007607 (0.004826)	0.574867 / 0.226044 (0.348822)	5.707623 / 2.268929 (3.438695)	2.909746 / 55.444624 (-52.534878)	2.585423 / 6.876477 (-4.291054)	2.636801 / 2.142072 (0.494729)	0.686593 / 4.805227 (-4.118634)	0.136633 / 6.500664 (-6.364031)	0.068598 / 0.075469 (-0.006871)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.286628 / 1.841788 (-0.555159)	14.333258 / 8.074308 (6.258949)	14.355793 / 10.191392 (4.164401)	0.133459 / 0.680424 (-0.546965)	0.017090 / 0.534201 (-0.517111)	0.358852 / 0.579283 (-0.220431)	0.399929 / 0.434364 (-0.034435)	0.422838 / 0.540337 (-0.117500)	0.515199 / 1.386936 (-0.871737)

lhoestq added 4 commits May 2, 2023 18:26

add iterable arrow formatting

b860cf6

some tests

a905a19

fix filter

6c868e1

add test

f8417a4

fix test

95457f2

lhoestq mentioned this pull request May 9, 2023

Add IterableDataset.from_spark #5770

Merged

tests and fixes

8f019df

lhoestq marked this pull request as ready for review May 10, 2023 11:40

lhoestq requested review from polinaeterna and mariosasko May 10, 2023 11:42

use ArrowExamplesIterable in ArrowBasedBuilder.as_streaming_dataset

00b148b

This was referenced May 12, 2023

Iterable torch formatting #5852

Merged

Load a cached dataset as iterable #5481

Open

lhoestq added 2 commits May 24, 2023 17:19

Merge branch 'main' into iterable-arrow-formatting

5bd9c97

missing fn_kwargs in filter

bd373f6

albertvillanova reviewed May 26, 2023

View reviewed changes

albert's comments

f2778e1

albertvillanova reviewed May 30, 2023

View reviewed changes

albert's comment: update docstring

028822a

albertvillanova reviewed May 31, 2023

View reviewed changes

lhoestq merged commit 7437d0f into main May 31, 2023

lhoestq deleted the iterable-arrow-formatting branch May 31, 2023 09:36

IterableDataset Arrow formatting #5821

IterableDataset Arrow formatting #5821

Conversation

lhoestq commented May 4, 2023 • edited Loading

Example:

Implementation details

github-actions bot commented May 4, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented May 4, 2023 • edited Loading

github-actions bot commented May 9, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 10, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented May 19, 2023

github-actions bot commented May 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented May 24, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhoestq commented May 26, 2023

github-actions bot commented May 26, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

lhoestq commented May 4, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 4, 2023 •

edited

Loading

albertvillanova left a comment •

edited

Loading