Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 #6404

albertvillanova · 2023-11-13T09:15:39Z

Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248.

github-actions · 2023-11-13T09:22:56Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005974 / 0.011353 (-0.005378)	0.003707 / 0.011008 (-0.007301)	0.079908 / 0.038508 (0.041399)	0.036891 / 0.023109 (0.013781)	0.390355 / 0.275898 (0.114457)	0.424439 / 0.323480 (0.100960)	0.004936 / 0.007986 (-0.003050)	0.002886 / 0.004328 (-0.001442)	0.062793 / 0.004250 (0.058542)	0.054192 / 0.037052 (0.017139)	0.394697 / 0.258489 (0.136208)	0.437775 / 0.293841 (0.143934)	0.027596 / 0.128546 (-0.100950)	0.008006 / 0.075646 (-0.067640)	0.262515 / 0.419271 (-0.156757)	0.071014 / 0.043533 (0.027481)	0.392964 / 0.255139 (0.137825)	0.417449 / 0.283200 (0.134249)	0.021819 / 0.141683 (-0.119864)	1.458083 / 1.452155 (0.005929)	1.489042 / 1.492716 (-0.003674)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.230303 / 0.018006 (0.212297)	0.439361 / 0.000490 (0.438871)	0.010615 / 0.000200 (0.010415)	0.000303 / 0.000054 (0.000249)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026600 / 0.037411 (-0.010811)	0.078605 / 0.014526 (0.064079)	0.088552 / 0.176557 (-0.088005)	0.149429 / 0.737135 (-0.587706)	0.087921 / 0.296338 (-0.208417)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.422063 / 0.215209 (0.206854)	4.201333 / 2.077655 (2.123678)	1.982284 / 1.504120 (0.478164)	1.779625 / 1.541195 (0.238431)	1.872454 / 1.468490 (0.403964)	0.502713 / 4.584777 (-4.082063)	3.103372 / 3.745712 (-0.642340)	3.030516 / 5.269862 (-2.239346)	1.909123 / 4.565676 (-2.656554)	0.057134 / 0.424275 (-0.367141)	0.006405 / 0.007607 (-0.001202)	0.494452 / 0.226044 (0.268408)	4.839345 / 2.268929 (2.570417)	2.424721 / 55.444624 (-53.019904)	2.028618 / 6.876477 (-4.847859)	2.082528 / 2.142072 (-0.059545)	0.587396 / 4.805227 (-4.217831)	0.125013 / 6.500664 (-6.375651)	0.061369 / 0.075469 (-0.014100)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.235799 / 1.841788 (-0.605989)	17.919977 / 8.074308 (9.845669)	13.868524 / 10.191392 (3.677132)	0.146058 / 0.680424 (-0.534366)	0.016826 / 0.534201 (-0.517375)	0.337512 / 0.579283 (-0.241771)	0.390263 / 0.434364 (-0.044101)	0.385336 / 0.540337 (-0.155001)	0.566004 / 1.386936 (-0.820932)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006537 / 0.011353 (-0.004816)	0.003787 / 0.011008 (-0.007221)	0.062568 / 0.038508 (0.024060)	0.066672 / 0.023109 (0.043563)	0.420447 / 0.275898 (0.144549)	0.457260 / 0.323480 (0.133780)	0.005005 / 0.007986 (-0.002981)	0.003037 / 0.004328 (-0.001291)	0.062095 / 0.004250 (0.057844)	0.049619 / 0.037052 (0.012567)	0.429935 / 0.258489 (0.171446)	0.471566 / 0.293841 (0.177725)	0.029688 / 0.128546 (-0.098859)	0.008028 / 0.075646 (-0.067619)	0.067915 / 0.419271 (-0.351356)	0.042066 / 0.043533 (-0.001467)	0.419275 / 0.255139 (0.164136)	0.444819 / 0.283200 (0.161619)	0.020100 / 0.141683 (-0.121583)	1.439057 / 1.452155 (-0.013098)	1.495657 / 1.492716 (0.002940)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.211148 / 0.018006 (0.193142)	0.423777 / 0.000490 (0.423288)	0.005892 / 0.000200 (0.005693)	0.000086 / 0.000054 (0.000032)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026469 / 0.037411 (-0.010942)	0.081438 / 0.014526 (0.066912)	0.092007 / 0.176557 (-0.084550)	0.143433 / 0.737135 (-0.593703)	0.093039 / 0.296338 (-0.203300)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.410468 / 0.215209 (0.195259)	4.083783 / 2.077655 (2.006128)	2.234501 / 1.504120 (0.730381)	2.122323 / 1.541195 (0.581128)	2.255036 / 1.468490 (0.786546)	0.497712 / 4.584777 (-4.087065)	3.231187 / 3.745712 (-0.514525)	3.005399 / 5.269862 (-2.264463)	1.909516 / 4.565676 (-2.656161)	0.057529 / 0.424275 (-0.366746)	0.006475 / 0.007607 (-0.001132)	0.477282 / 0.226044 (0.251238)	4.799566 / 2.268929 (2.530637)	2.497070 / 55.444624 (-52.947554)	2.206359 / 6.876477 (-4.670118)	2.281614 / 2.142072 (0.139541)	0.581710 / 4.805227 (-4.223518)	0.121572 / 6.500664 (-6.379092)	0.058774 / 0.075469 (-0.016695)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.301880 / 1.841788 (-0.539908)	18.287330 / 8.074308 (10.213021)	14.939642 / 10.191392 (4.748250)	0.153941 / 0.680424 (-0.526483)	0.018345 / 0.534201 (-0.515856)	0.335986 / 0.579283 (-0.243297)	0.384264 / 0.434364 (-0.050099)	0.393115 / 0.540337 (-0.147223)	0.573343 / 1.386936 (-0.813594)

github-actions · 2023-11-13T10:08:39Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004805 / 0.011353 (-0.006548)	0.003261 / 0.011008 (-0.007747)	0.061585 / 0.038508 (0.023077)	0.030236 / 0.023109 (0.007127)	0.234767 / 0.275898 (-0.041131)	0.260478 / 0.323480 (-0.063002)	0.004121 / 0.007986 (-0.003865)	0.002525 / 0.004328 (-0.001803)	0.048213 / 0.004250 (0.043962)	0.045229 / 0.037052 (0.008176)	0.245143 / 0.258489 (-0.013346)	0.271818 / 0.293841 (-0.022023)	0.023594 / 0.128546 (-0.104952)	0.007335 / 0.075646 (-0.068311)	0.206246 / 0.419271 (-0.213026)	0.060783 / 0.043533 (0.017250)	0.238588 / 0.255139 (-0.016551)	0.274985 / 0.283200 (-0.008214)	0.018342 / 0.141683 (-0.123341)	1.135445 / 1.452155 (-0.316710)	1.184836 / 1.492716 (-0.307881)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.095603 / 0.018006 (0.077597)	0.290340 / 0.000490 (0.289850)	0.000219 / 0.000200 (0.000019)	0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018804 / 0.037411 (-0.018607)	0.062525 / 0.014526 (0.047999)	0.074797 / 0.176557 (-0.101760)	0.120360 / 0.737135 (-0.616775)	0.076182 / 0.296338 (-0.220156)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.274981 / 0.215209 (0.059772)	2.684931 / 2.077655 (0.607276)	1.453845 / 1.504120 (-0.050275)	1.348361 / 1.541195 (-0.192834)	1.402820 / 1.468490 (-0.065670)	0.396311 / 4.584777 (-4.188466)	2.396314 / 3.745712 (-1.349398)	2.744379 / 5.269862 (-2.525482)	1.615268 / 4.565676 (-2.950409)	0.045920 / 0.424275 (-0.378355)	0.004844 / 0.007607 (-0.002763)	0.331132 / 0.226044 (0.105087)	3.325484 / 2.268929 (1.056556)	1.845734 / 55.444624 (-53.598890)	1.537268 / 6.876477 (-5.339209)	1.565155 / 2.142072 (-0.576918)	0.480032 / 4.805227 (-4.325195)	0.099917 / 6.500664 (-6.400747)	0.042276 / 0.075469 (-0.033193)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.973128 / 1.841788 (-0.868660)	12.643790 / 8.074308 (4.569482)	10.319586 / 10.191392 (0.128194)	0.131733 / 0.680424 (-0.548691)	0.014849 / 0.534201 (-0.519352)	0.270960 / 0.579283 (-0.308323)	0.265409 / 0.434364 (-0.168955)	0.309073 / 0.540337 (-0.231264)	0.466204 / 1.386936 (-0.920732)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005067 / 0.011353 (-0.006286)	0.003344 / 0.011008 (-0.007665)	0.047917 / 0.038508 (0.009409)	0.059556 / 0.023109 (0.036447)	0.275777 / 0.275898 (-0.000121)	0.299703 / 0.323480 (-0.023777)	0.004185 / 0.007986 (-0.003801)	0.002602 / 0.004328 (-0.001726)	0.048723 / 0.004250 (0.044472)	0.040686 / 0.037052 (0.003634)	0.281078 / 0.258489 (0.022589)	0.314725 / 0.293841 (0.020885)	0.024645 / 0.128546 (-0.103901)	0.007465 / 0.075646 (-0.068182)	0.053827 / 0.419271 (-0.365445)	0.033395 / 0.043533 (-0.010138)	0.273675 / 0.255139 (0.018536)	0.291261 / 0.283200 (0.008062)	0.019733 / 0.141683 (-0.121950)	1.134084 / 1.452155 (-0.318071)	1.189186 / 1.492716 (-0.303531)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.114960 / 0.018006 (0.096954)	0.308800 / 0.000490 (0.308311)	0.000237 / 0.000200 (0.000037)	0.000061 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021633 / 0.037411 (-0.015778)	0.073192 / 0.014526 (0.058666)	0.081598 / 0.176557 (-0.094959)	0.123085 / 0.737135 (-0.614050)	0.088677 / 0.296338 (-0.207661)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.300865 / 0.215209 (0.085656)	2.956847 / 2.077655 (0.879192)	1.613890 / 1.504120 (0.109770)	1.494074 / 1.541195 (-0.047121)	1.550345 / 1.468490 (0.081855)	0.408880 / 4.584777 (-4.175897)	2.422848 / 3.745712 (-1.322865)	2.690623 / 5.269862 (-2.579239)	1.546922 / 4.565676 (-3.018755)	0.047192 / 0.424275 (-0.377083)	0.004882 / 0.007607 (-0.002725)	0.360625 / 0.226044 (0.134580)	3.512678 / 2.268929 (1.243749)	1.978633 / 55.444624 (-53.465992)	1.686927 / 6.876477 (-5.189549)	1.748387 / 2.142072 (-0.393685)	0.480780 / 4.805227 (-4.324447)	0.099163 / 6.500664 (-6.401501)	0.041194 / 0.075469 (-0.034275)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.989087 / 1.841788 (-0.852700)	12.341951 / 8.074308 (4.267643)	11.109329 / 10.191392 (0.917936)	0.143329 / 0.680424 (-0.537095)	0.015565 / 0.534201 (-0.518636)	0.269532 / 0.579283 (-0.309751)	0.274899 / 0.434364 (-0.159465)	0.309308 / 0.540337 (-0.231030)	0.439651 / 1.386936 (-0.947285)

github-actions · 2023-11-13T10:25:20Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007880 / 0.011353 (-0.003473)	0.004386 / 0.011008 (-0.006622)	0.099067 / 0.038508 (0.060559)	0.048036 / 0.023109 (0.024927)	0.368349 / 0.275898 (0.092451)	0.400052 / 0.323480 (0.076572)	0.004493 / 0.007986 (-0.003493)	0.003732 / 0.004328 (-0.000597)	0.076153 / 0.004250 (0.071902)	0.071024 / 0.037052 (0.033972)	0.379771 / 0.258489 (0.121282)	0.425005 / 0.293841 (0.131164)	0.036092 / 0.128546 (-0.092454)	0.009825 / 0.075646 (-0.065822)	0.340217 / 0.419271 (-0.079055)	0.089571 / 0.043533 (0.046038)	0.371426 / 0.255139 (0.116287)	0.397864 / 0.283200 (0.114664)	0.029440 / 0.141683 (-0.112243)	1.778100 / 1.452155 (0.325945)	1.857202 / 1.492716 (0.364486)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.254022 / 0.018006 (0.236015)	0.549844 / 0.000490 (0.549354)	0.012824 / 0.000200 (0.012624)	0.000378 / 0.000054 (0.000324)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032334 / 0.037411 (-0.005077)	0.096101 / 0.014526 (0.081576)	0.117825 / 0.176557 (-0.058731)	0.179277 / 0.737135 (-0.557858)	0.112614 / 0.296338 (-0.183724)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.455051 / 0.215209 (0.239842)	4.537086 / 2.077655 (2.459431)	2.198662 / 1.504120 (0.694542)	1.982772 / 1.541195 (0.441578)	2.058673 / 1.468490 (0.590182)	0.569268 / 4.584777 (-4.015509)	4.095000 / 3.745712 (0.349288)	3.891680 / 5.269862 (-1.378182)	2.345129 / 4.565676 (-2.220548)	0.066974 / 0.424275 (-0.357301)	0.008557 / 0.007607 (0.000950)	0.545290 / 0.226044 (0.319245)	5.453377 / 2.268929 (3.184448)	2.858688 / 55.444624 (-52.585936)	2.502367 / 6.876477 (-4.374109)	2.515658 / 2.142072 (0.373586)	0.681423 / 4.805227 (-4.123804)	0.155975 / 6.500664 (-6.344689)	0.070872 / 0.075469 (-0.004597)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.474674 / 1.841788 (-0.367114)	21.653619 / 8.074308 (13.579311)	16.277111 / 10.191392 (6.085719)	0.166445 / 0.680424 (-0.513979)	0.021676 / 0.534201 (-0.512525)	0.466949 / 0.579283 (-0.112334)	0.500953 / 0.434364 (0.066589)	0.540413 / 0.540337 (0.000076)	0.792989 / 1.386936 (-0.593947)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007633 / 0.011353 (-0.003720)	0.004468 / 0.011008 (-0.006540)	0.075573 / 0.038508 (0.037065)	0.081174 / 0.023109 (0.058064)	0.440741 / 0.275898 (0.164843)	0.489493 / 0.323480 (0.166013)	0.006180 / 0.007986 (-0.001805)	0.003693 / 0.004328 (-0.000636)	0.074692 / 0.004250 (0.070441)	0.061732 / 0.037052 (0.024680)	0.460391 / 0.258489 (0.201902)	0.505575 / 0.293841 (0.211734)	0.037692 / 0.128546 (-0.090854)	0.009870 / 0.075646 (-0.065776)	0.083830 / 0.419271 (-0.335442)	0.056255 / 0.043533 (0.012723)	0.439330 / 0.255139 (0.184191)	0.475598 / 0.283200 (0.192399)	0.026626 / 0.141683 (-0.115056)	1.794410 / 1.452155 (0.342255)	1.882510 / 1.492716 (0.389794)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.236194 / 0.018006 (0.218187)	0.486109 / 0.000490 (0.485619)	0.006652 / 0.000200 (0.006453)	0.000108 / 0.000054 (0.000053)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037277 / 0.037411 (-0.000134)	0.108904 / 0.014526 (0.094378)	0.122699 / 0.176557 (-0.053857)	0.182388 / 0.737135 (-0.554747)	0.122826 / 0.296338 (-0.173512)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.485989 / 0.215209 (0.270780)	4.913263 / 2.077655 (2.835609)	2.571618 / 1.504120 (1.067498)	2.401248 / 1.541195 (0.860054)	2.501117 / 1.468490 (1.032627)	0.570989 / 4.584777 (-4.013788)	4.107420 / 3.745712 (0.361708)	3.814977 / 5.269862 (-1.454885)	2.282539 / 4.565676 (-2.283138)	0.067765 / 0.424275 (-0.356511)	0.008561 / 0.007607 (0.000954)	0.584515 / 0.226044 (0.358471)	5.817821 / 2.268929 (3.548893)	3.211202 / 55.444624 (-52.233422)	2.764480 / 6.876477 (-4.111996)	2.807301 / 2.142072 (0.665229)	0.676882 / 4.805227 (-4.128346)	0.150124 / 6.500664 (-6.350540)	0.067205 / 0.075469 (-0.008265)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.594945 / 1.841788 (-0.246843)	22.533511 / 8.074308 (14.459203)	17.099693 / 10.191392 (6.908301)	0.195954 / 0.680424 (-0.484470)	0.023968 / 0.534201 (-0.510233)	0.471337 / 0.579283 (-0.107946)	0.491017 / 0.434364 (0.056653)	0.561342 / 0.540337 (0.021004)	0.797116 / 1.386936 (-0.589820)

github-actions · 2023-11-13T10:49:50Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006235 / 0.011353 (-0.005118)	0.003688 / 0.011008 (-0.007321)	0.080801 / 0.038508 (0.042293)	0.036243 / 0.023109 (0.013134)	0.312173 / 0.275898 (0.036275)	0.346239 / 0.323480 (0.022759)	0.003429 / 0.007986 (-0.004556)	0.003806 / 0.004328 (-0.000523)	0.063236 / 0.004250 (0.058986)	0.053229 / 0.037052 (0.016177)	0.315184 / 0.258489 (0.056695)	0.360124 / 0.293841 (0.066283)	0.027447 / 0.128546 (-0.101099)	0.008029 / 0.075646 (-0.067618)	0.262766 / 0.419271 (-0.156505)	0.068421 / 0.043533 (0.024888)	0.309028 / 0.255139 (0.053889)	0.345859 / 0.283200 (0.062659)	0.021388 / 0.141683 (-0.120295)	1.452807 / 1.452155 (0.000652)	1.502803 / 1.492716 (0.010087)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.211297 / 0.018006 (0.193291)	0.423364 / 0.000490 (0.422874)	0.004574 / 0.000200 (0.004374)	0.000272 / 0.000054 (0.000218)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023805 / 0.037411 (-0.013606)	0.072309 / 0.014526 (0.057783)	0.083274 / 0.176557 (-0.093283)	0.143594 / 0.737135 (-0.593541)	0.083777 / 0.296338 (-0.212561)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.415691 / 0.215209 (0.200482)	4.128621 / 2.077655 (2.050967)	1.931128 / 1.504120 (0.427008)	1.737486 / 1.541195 (0.196292)	1.806314 / 1.468490 (0.337823)	0.501405 / 4.584777 (-4.083372)	3.082042 / 3.745712 (-0.663670)	2.980224 / 5.269862 (-2.289637)	1.879780 / 4.565676 (-2.685897)	0.057546 / 0.424275 (-0.366729)	0.006422 / 0.007607 (-0.001186)	0.479813 / 0.226044 (0.253768)	4.854497 / 2.268929 (2.585568)	2.529674 / 55.444624 (-52.914950)	2.283041 / 6.876477 (-4.593436)	2.377173 / 2.142072 (0.235101)	0.589654 / 4.805227 (-4.215573)	0.126190 / 6.500664 (-6.374474)	0.062391 / 0.075469 (-0.013079)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.232023 / 1.841788 (-0.609764)	17.576621 / 8.074308 (9.502313)	13.437075 / 10.191392 (3.245683)	0.143367 / 0.680424 (-0.537057)	0.016638 / 0.534201 (-0.517563)	0.332806 / 0.579283 (-0.246477)	0.356029 / 0.434364 (-0.078335)	0.385610 / 0.540337 (-0.154727)	0.563268 / 1.386936 (-0.823668)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006293 / 0.011353 (-0.005060)	0.003692 / 0.011008 (-0.007317)	0.062075 / 0.038508 (0.023567)	0.062104 / 0.023109 (0.038995)	0.407478 / 0.275898 (0.131580)	0.434982 / 0.323480 (0.111502)	0.004889 / 0.007986 (-0.003097)	0.002915 / 0.004328 (-0.001413)	0.061426 / 0.004250 (0.057176)	0.048027 / 0.037052 (0.010974)	0.410504 / 0.258489 (0.152015)	0.435383 / 0.293841 (0.141542)	0.029419 / 0.128546 (-0.099127)	0.008275 / 0.075646 (-0.067371)	0.067796 / 0.419271 (-0.351476)	0.041696 / 0.043533 (-0.001837)	0.398882 / 0.255139 (0.143743)	0.419480 / 0.283200 (0.136281)	0.021519 / 0.141683 (-0.120164)	1.436961 / 1.452155 (-0.015194)	1.507961 / 1.492716 (0.015245)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.223190 / 0.018006 (0.205184)	0.416281 / 0.000490 (0.415791)	0.003370 / 0.000200 (0.003170)	0.000080 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025923 / 0.037411 (-0.011488)	0.079989 / 0.014526 (0.065463)	0.091289 / 0.176557 (-0.085268)	0.141212 / 0.737135 (-0.595923)	0.091717 / 0.296338 (-0.204622)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434640 / 0.215209 (0.219431)	4.326154 / 2.077655 (2.248500)	2.364845 / 1.504120 (0.860725)	2.194040 / 1.541195 (0.652846)	2.276665 / 1.468490 (0.808175)	0.501879 / 4.584777 (-4.082898)	3.073307 / 3.745712 (-0.672405)	2.893823 / 5.269862 (-2.376039)	1.820594 / 4.565676 (-2.745083)	0.057595 / 0.424275 (-0.366680)	0.006516 / 0.007607 (-0.001091)	0.513633 / 0.226044 (0.287589)	5.104799 / 2.268929 (2.835870)	2.845025 / 55.444624 (-52.599599)	2.513852 / 6.876477 (-4.362624)	2.561044 / 2.142072 (0.418972)	0.582711 / 4.805227 (-4.222516)	0.120631 / 6.500664 (-6.380034)	0.056738 / 0.075469 (-0.018731)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.303370 / 1.841788 (-0.538418)	18.023568 / 8.074308 (9.949259)	14.637973 / 10.191392 (4.446581)	0.145182 / 0.680424 (-0.535241)	0.018061 / 0.534201 (-0.516140)	0.333219 / 0.579283 (-0.246065)	0.373184 / 0.434364 (-0.061180)	0.388176 / 0.540337 (-0.152161)	0.564752 / 1.386936 (-0.822184)

github-actions · 2023-11-13T10:57:13Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007230 / 0.011353 (-0.004122)	0.003727 / 0.011008 (-0.007281)	0.078893 / 0.038508 (0.040385)	0.042600 / 0.023109 (0.019491)	0.301905 / 0.275898 (0.026007)	0.328478 / 0.323480 (0.004998)	0.003960 / 0.007986 (-0.004026)	0.004530 / 0.004328 (0.000201)	0.059446 / 0.004250 (0.055196)	0.061241 / 0.037052 (0.024189)	0.301878 / 0.258489 (0.043389)	0.340935 / 0.293841 (0.047095)	0.030559 / 0.128546 (-0.097988)	0.008016 / 0.075646 (-0.067630)	0.305174 / 0.419271 (-0.114097)	0.080374 / 0.043533 (0.036842)	0.307162 / 0.255139 (0.052023)	0.342459 / 0.283200 (0.059259)	0.025881 / 0.141683 (-0.115801)	1.443311 / 1.452155 (-0.008844)	1.631060 / 1.492716 (0.138344)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.242676 / 0.018006 (0.224670)	0.463941 / 0.000490 (0.463451)	0.007762 / 0.000200 (0.007562)	0.000582 / 0.000054 (0.000527)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027334 / 0.037411 (-0.010077)	0.078910 / 0.014526 (0.064384)	0.091399 / 0.176557 (-0.085157)	0.143318 / 0.737135 (-0.593818)	0.089761 / 0.296338 (-0.206577)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.463002 / 0.215209 (0.247793)	4.627235 / 2.077655 (2.549580)	2.256699 / 1.504120 (0.752579)	2.057615 / 1.541195 (0.516421)	2.126424 / 1.468490 (0.657934)	0.571969 / 4.584777 (-4.012808)	4.130260 / 3.745712 (0.384548)	3.833521 / 5.269862 (-1.436341)	2.320141 / 4.565676 (-2.245535)	0.067587 / 0.424275 (-0.356688)	0.008452 / 0.007607 (0.000845)	0.546478 / 0.226044 (0.320433)	5.070678 / 2.268929 (2.801750)	2.325387 / 55.444624 (-53.119237)	2.044041 / 6.876477 (-4.832435)	2.019714 / 2.142072 (-0.122358)	0.563589 / 4.805227 (-4.241639)	0.135269 / 6.500664 (-6.365395)	0.058208 / 0.075469 (-0.017261)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.283156 / 1.841788 (-0.558631)	18.617776 / 8.074308 (10.543468)	13.360700 / 10.191392 (3.169308)	0.160001 / 0.680424 (-0.520423)	0.021538 / 0.534201 (-0.512663)	0.384169 / 0.579283 (-0.195114)	0.407517 / 0.434364 (-0.026847)	0.427295 / 0.540337 (-0.113042)	0.655288 / 1.386936 (-0.731648)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006854 / 0.011353 (-0.004499)	0.003442 / 0.011008 (-0.007566)	0.060622 / 0.038508 (0.022114)	0.074649 / 0.023109 (0.051540)	0.341733 / 0.275898 (0.065835)	0.360096 / 0.323480 (0.036616)	0.006235 / 0.007986 (-0.001751)	0.003447 / 0.004328 (-0.000882)	0.057301 / 0.004250 (0.053051)	0.059022 / 0.037052 (0.021970)	0.369523 / 0.258489 (0.111034)	0.386280 / 0.293841 (0.092439)	0.034319 / 0.128546 (-0.094228)	0.008291 / 0.075646 (-0.067355)	0.070403 / 0.419271 (-0.348868)	0.050433 / 0.043533 (0.006901)	0.347262 / 0.255139 (0.092123)	0.380543 / 0.283200 (0.097343)	0.024492 / 0.141683 (-0.117191)	1.446721 / 1.452155 (-0.005433)	1.541614 / 1.492716 (0.048898)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.226148 / 0.018006 (0.208142)	0.442150 / 0.000490 (0.441660)	0.004997 / 0.000200 (0.004797)	0.000096 / 0.000054 (0.000041)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032866 / 0.037411 (-0.004546)	0.088097 / 0.014526 (0.073571)	0.102178 / 0.176557 (-0.074379)	0.151129 / 0.737135 (-0.586006)	0.103953 / 0.296338 (-0.192386)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.376701 / 0.215209 (0.161492)	3.886997 / 2.077655 (1.809342)	2.027143 / 1.504120 (0.523023)	1.808647 / 1.541195 (0.267453)	1.867664 / 1.468490 (0.399173)	0.459487 / 4.584777 (-4.125290)	3.640801 / 3.745712 (-0.104911)	3.242512 / 5.269862 (-2.027350)	1.889174 / 4.565676 (-2.676503)	0.052415 / 0.424275 (-0.371860)	0.007479 / 0.007607 (-0.000128)	0.457706 / 0.226044 (0.231662)	4.815041 / 2.268929 (2.546112)	2.542470 / 55.444624 (-52.902154)	2.137084 / 6.876477 (-4.739392)	2.122867 / 2.142072 (-0.019205)	0.553756 / 4.805227 (-4.251471)	0.118902 / 6.500664 (-6.381763)	0.058149 / 0.075469 (-0.017320)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.272615 / 1.841788 (-0.569173)	19.455709 / 8.074308 (11.381401)	14.111693 / 10.191392 (3.920301)	0.165741 / 0.680424 (-0.514683)	0.023680 / 0.534201 (-0.510521)	0.431458 / 0.579283 (-0.147825)	0.433612 / 0.434364 (-0.000752)	0.465615 / 0.540337 (-0.074722)	0.678177 / 1.386936 (-0.708759)

github-actions · 2023-11-13T11:14:18Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004870 / 0.011353 (-0.006483)	0.002834 / 0.011008 (-0.008175)	0.061359 / 0.038508 (0.022851)	0.031286 / 0.023109 (0.008177)	0.236701 / 0.275898 (-0.039197)	0.258139 / 0.323480 (-0.065341)	0.002943 / 0.007986 (-0.005043)	0.002989 / 0.004328 (-0.001339)	0.048046 / 0.004250 (0.043796)	0.044927 / 0.037052 (0.007874)	0.241339 / 0.258489 (-0.017151)	0.273912 / 0.293841 (-0.019929)	0.023427 / 0.128546 (-0.105119)	0.007251 / 0.075646 (-0.068395)	0.202730 / 0.419271 (-0.216542)	0.056223 / 0.043533 (0.012691)	0.239908 / 0.255139 (-0.015231)	0.254723 / 0.283200 (-0.028476)	0.018223 / 0.141683 (-0.123460)	1.119691 / 1.452155 (-0.332464)	1.163802 / 1.492716 (-0.328915)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091303 / 0.018006 (0.073297)	0.302097 / 0.000490 (0.301607)	0.000214 / 0.000200 (0.000014)	0.000044 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018201 / 0.037411 (-0.019210)	0.062092 / 0.014526 (0.047566)	0.074806 / 0.176557 (-0.101751)	0.119625 / 0.737135 (-0.617510)	0.074680 / 0.296338 (-0.221659)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.281140 / 0.215209 (0.065931)	2.752094 / 2.077655 (0.674439)	1.436813 / 1.504120 (-0.067307)	1.312947 / 1.541195 (-0.228247)	1.331022 / 1.468490 (-0.137468)	0.396579 / 4.584777 (-4.188198)	2.406181 / 3.745712 (-1.339531)	2.597180 / 5.269862 (-2.672682)	1.565879 / 4.565676 (-2.999798)	0.046330 / 0.424275 (-0.377945)	0.004776 / 0.007607 (-0.002831)	0.339681 / 0.226044 (0.113637)	3.279533 / 2.268929 (1.010605)	1.793352 / 55.444624 (-53.651272)	1.493910 / 6.876477 (-5.382567)	1.514494 / 2.142072 (-0.627579)	0.467955 / 4.805227 (-4.337272)	0.097764 / 6.500664 (-6.402900)	0.041659 / 0.075469 (-0.033810)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.943204 / 1.841788 (-0.898583)	11.350848 / 8.074308 (3.276540)	10.169944 / 10.191392 (-0.021448)	0.130882 / 0.680424 (-0.549542)	0.013804 / 0.534201 (-0.520397)	0.269107 / 0.579283 (-0.310177)	0.261685 / 0.434364 (-0.172679)	0.305610 / 0.540337 (-0.234727)	0.430586 / 1.386936 (-0.956350)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004835 / 0.011353 (-0.006518)	0.002530 / 0.011008 (-0.008479)	0.047383 / 0.038508 (0.008875)	0.052559 / 0.023109 (0.029450)	0.265015 / 0.275898 (-0.010883)	0.286955 / 0.323480 (-0.036525)	0.003931 / 0.007986 (-0.004054)	0.002038 / 0.004328 (-0.002290)	0.047458 / 0.004250 (0.043207)	0.038257 / 0.037052 (0.001205)	0.270569 / 0.258489 (0.012080)	0.298968 / 0.293841 (0.005127)	0.024615 / 0.128546 (-0.103932)	0.006969 / 0.075646 (-0.068677)	0.052361 / 0.419271 (-0.366911)	0.032701 / 0.043533 (-0.010832)	0.269126 / 0.255139 (0.013987)	0.285934 / 0.283200 (0.002735)	0.018121 / 0.141683 (-0.123562)	1.129796 / 1.452155 (-0.322359)	1.272831 / 1.492716 (-0.219885)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092058 / 0.018006 (0.074051)	0.303544 / 0.000490 (0.303054)	0.000232 / 0.000200 (0.000032)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.020983 / 0.037411 (-0.016428)	0.069798 / 0.014526 (0.055272)	0.081410 / 0.176557 (-0.095146)	0.120403 / 0.737135 (-0.616732)	0.082813 / 0.296338 (-0.213525)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.295943 / 0.215209 (0.080734)	2.895761 / 2.077655 (0.818106)	1.583534 / 1.504120 (0.079414)	1.458397 / 1.541195 (-0.082798)	1.492113 / 1.468490 (0.023623)	0.402364 / 4.584777 (-4.182413)	2.469777 / 3.745712 (-1.275935)	2.565262 / 5.269862 (-2.704599)	1.525914 / 4.565676 (-3.039763)	0.047168 / 0.424275 (-0.377107)	0.004800 / 0.007607 (-0.002808)	0.348356 / 0.226044 (0.122311)	3.463184 / 2.268929 (1.194255)	1.930240 / 55.444624 (-53.514385)	1.644312 / 6.876477 (-5.232165)	1.625477 / 2.142072 (-0.516596)	0.480781 / 4.805227 (-4.324446)	0.098431 / 6.500664 (-6.402233)	0.041071 / 0.075469 (-0.034398)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.973633 / 1.841788 (-0.868154)	11.952261 / 8.074308 (3.877953)	11.038222 / 10.191392 (0.846830)	0.142755 / 0.680424 (-0.537669)	0.015389 / 0.534201 (-0.518812)	0.274144 / 0.579283 (-0.305139)	0.282319 / 0.434364 (-0.152045)	0.314330 / 0.540337 (-0.226007)	0.435315 / 1.386936 (-0.951621)

albertvillanova · 2023-11-13T14:17:52Z

The red CI job is unrelated to this PR. It appeared 5 days ago. See:

handle future deprecation argument #6390 (review)
CI Build PR Documentation is broken: ImportError: cannot import name 'TypeAliasType' from 'typing_extensions' #6406

lhoestq

Thanks for the fix ! Maybe add pyarrow-hotfix as a requirement before merging

With this change it won't be possible to load old datasets with ArrayND types saved as Parquet or Arrow anymore, but I don't think we can do anything to avoid that.

lhoestq · 2023-11-13T14:17:52Z

setup.py

-    # Minimum 8.0.0 to be able to use .to_reader()
-    "pyarrow>=8.0.0",
+    # Minimum 14.0.1 to fix vulnerability CVE-2023-47248
+    "pyarrow>=9.0.0",  # TODO: maximum version allowed by Apache Beam


We can require pyarrow-hotfix<1 and import pyarrox_hotfix to fix CVE-2023-47248 without requiring 14.0.1

lhoestq · 2023-11-13T14:19:07Z

.github/workflows/ci.yml

        if: ${{ matrix.deps_versions != 'deps-latest' }}
-        run: pip install pyarrow==8.0.0 huggingface-hub==0.18.0 transformers dill==0.3.1.1
+        run: pip install pyarrow==14.0.1 huggingface-hub==0.18.0 transformers dill==0.3.1.1


Suggested change

run: pip install pyarrow==14.0.1 huggingface-hub==0.18.0 transformers dill==0.3.1.1

run: pip install pyarrow==9.0.0 huggingface-hub==0.18.0 transformers dill==0.3.1.1

lhoestq · 2023-11-13T16:40:13Z

Let's do a new release once this is merged ? cc @mariosasko as well let us know if the fix sounds good to you

mariosasko · 2023-11-13T17:58:37Z

@lhoestq Yes, this sounds good to me!

github-actions · 2023-11-14T09:00:28Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004932 / 0.011353 (-0.006421)	0.002956 / 0.011008 (-0.008052)	0.061999 / 0.038508 (0.023491)	0.030174 / 0.023109 (0.007065)	0.241483 / 0.275898 (-0.034415)	0.261578 / 0.323480 (-0.061902)	0.002881 / 0.007986 (-0.005105)	0.002451 / 0.004328 (-0.001878)	0.048176 / 0.004250 (0.043925)	0.045028 / 0.037052 (0.007976)	0.244304 / 0.258489 (-0.014185)	0.275834 / 0.293841 (-0.018007)	0.023312 / 0.128546 (-0.105234)	0.007361 / 0.075646 (-0.068286)	0.204433 / 0.419271 (-0.214838)	0.054561 / 0.043533 (0.011028)	0.236902 / 0.255139 (-0.018237)	0.269358 / 0.283200 (-0.013842)	0.017736 / 0.141683 (-0.123947)	1.112444 / 1.452155 (-0.339711)	1.170260 / 1.492716 (-0.322456)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093081 / 0.018006 (0.075074)	0.311470 / 0.000490 (0.310981)	0.000212 / 0.000200 (0.000013)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018654 / 0.037411 (-0.018757)	0.063239 / 0.014526 (0.048714)	0.073759 / 0.176557 (-0.102798)	0.120279 / 0.737135 (-0.616857)	0.076214 / 0.296338 (-0.220124)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.287219 / 0.215209 (0.072010)	2.765378 / 2.077655 (0.687723)	1.459733 / 1.504120 (-0.044387)	1.325999 / 1.541195 (-0.215196)	1.349957 / 1.468490 (-0.118533)	0.413093 / 4.584777 (-4.171684)	2.394758 / 3.745712 (-1.350954)	2.633916 / 5.269862 (-2.635945)	1.621629 / 4.565676 (-2.944047)	0.046839 / 0.424275 (-0.377436)	0.004786 / 0.007607 (-0.002822)	0.336261 / 0.226044 (0.110217)	3.348196 / 2.268929 (1.079267)	1.853050 / 55.444624 (-53.591574)	1.543926 / 6.876477 (-5.332551)	1.573675 / 2.142072 (-0.568398)	0.484088 / 4.805227 (-4.321139)	0.100820 / 6.500664 (-6.399845)	0.042194 / 0.075469 (-0.033275)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.945186 / 1.841788 (-0.896601)	11.859855 / 8.074308 (3.785547)	10.459883 / 10.191392 (0.268491)	0.142024 / 0.680424 (-0.538400)	0.013882 / 0.534201 (-0.520319)	0.269584 / 0.579283 (-0.309699)	0.264353 / 0.434364 (-0.170011)	0.307988 / 0.540337 (-0.232349)	0.423655 / 1.386936 (-0.963281)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004891 / 0.011353 (-0.006461)	0.003087 / 0.011008 (-0.007921)	0.048206 / 0.038508 (0.009697)	0.058570 / 0.023109 (0.035461)	0.268552 / 0.275898 (-0.007346)	0.287839 / 0.323480 (-0.035641)	0.004044 / 0.007986 (-0.003942)	0.002388 / 0.004328 (-0.001940)	0.048186 / 0.004250 (0.043935)	0.038719 / 0.037052 (0.001667)	0.271940 / 0.258489 (0.013451)	0.299716 / 0.293841 (0.005875)	0.027166 / 0.128546 (-0.101380)	0.007388 / 0.075646 (-0.068258)	0.053885 / 0.419271 (-0.365387)	0.032804 / 0.043533 (-0.010729)	0.271664 / 0.255139 (0.016525)	0.284613 / 0.283200 (0.001414)	0.018488 / 0.141683 (-0.123195)	1.125854 / 1.452155 (-0.326301)	1.195896 / 1.492716 (-0.296820)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092438 / 0.018006 (0.074431)	0.315265 / 0.000490 (0.314775)	0.000228 / 0.000200 (0.000028)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021373 / 0.037411 (-0.016038)	0.070611 / 0.014526 (0.056085)	0.080391 / 0.176557 (-0.096165)	0.118749 / 0.737135 (-0.618386)	0.082340 / 0.296338 (-0.213999)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.295583 / 0.215209 (0.080374)	2.882152 / 2.077655 (0.804497)	1.565088 / 1.504120 (0.060968)	1.451954 / 1.541195 (-0.089241)	1.505783 / 1.468490 (0.037293)	0.404699 / 4.584777 (-4.180078)	2.451703 / 3.745712 (-1.294009)	2.596301 / 5.269862 (-2.673560)	1.547014 / 4.565676 (-3.018662)	0.047750 / 0.424275 (-0.376525)	0.004850 / 0.007607 (-0.002757)	0.346893 / 0.226044 (0.120849)	3.383355 / 2.268929 (1.114426)	1.943933 / 55.444624 (-53.500692)	1.657513 / 6.876477 (-5.218964)	1.687166 / 2.142072 (-0.454906)	0.478543 / 4.805227 (-4.326685)	0.097804 / 6.500664 (-6.402860)	0.041392 / 0.075469 (-0.034078)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.983894 / 1.841788 (-0.857893)	12.446443 / 8.074308 (4.372135)	10.973461 / 10.191392 (0.782069)	0.131630 / 0.680424 (-0.548794)	0.017196 / 0.534201 (-0.517005)	0.270873 / 0.579283 (-0.308411)	0.284379 / 0.434364 (-0.149985)	0.306103 / 0.540337 (-0.234234)	0.413762 / 1.386936 (-0.973174)

albertvillanova · 2023-11-14T09:24:03Z

Note I had to add pa.ExtensionType.__reduce__ because this is used by copy.deepcopy when using .with_format. See error below.

This method was added in pyarrow-13.0.0: apache/arrow#36170

We need to re-implement it as long we support lower pyarrow versions

Errors: https://github.com/huggingface/datasets/actions/runs/6861278161/job/18656665772

 ____________________________ test_dataset_map[True] ____________________________
[gw1] linux -- Python 3.8.18 /opt/hostedtoolcache/Python/3.8.18/x64/bin/python

>   ???
E   KeyError: 'extension<datasets.features.features.array3dextensiontype<array3dextensiontype>>'

pyarrow/types.pxi:3155: KeyError

During handling of the above exception, another exception occurred:

with_none = True

    @pytest.mark.parametrize("with_none", [False, True])
    def test_dataset_map(with_none):
        ds = datasets.Dataset.from_dict({"path": ["path1", "path2"]})
    
        def process_data(batch):
            batch = {
                "image": [
                    np.array(
                        [
                            [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
                            [[10, 20, 30], [40, 50, 60], [70, 80, 90]],
                            [[100, 200, 300], [400, 500, 600], [700, 800, 900]],
                        ]
                    )
                    for _ in batch["path"]
                ]
            }
            if with_none:
                batch["image"][0] = None
            return batch
    
        features = datasets.Features({"image": Array3D(dtype="int32", shape=(3, 3, 3))})
        processed_ds = ds.map(process_data, batched=True, remove_columns=ds.column_names, features=features)
        assert processed_ds.shape == (2, 1)
>       with processed_ds.with_format("numpy") as pds:

tests/features/test_array_xd.py:459: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/datasets/arrow_dataset.py:2669: in with_format
    dataset = copy.deepcopy(self)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:172: in deepcopy
    y = _reconstruct(x, memo, *rv)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:270: in _reconstruct
    state = deepcopy(state, memo)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:146: in deepcopy
    y = copier(x, memo)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:230: in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:153: in deepcopy
    y = copier(memo)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/datasets/table.py:188: in __deepcopy__
    return _deepcopy(self, memo)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/datasets/table.py:86: in _deepcopy
    setattr(result, k, copy.deepcopy(v, memo))
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:172: in deepcopy
    y = _reconstruct(x, memo, *rv)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:264: in _reconstruct
    y = func(*args)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:263: in <genexpr>
    args = (deepcopy(arg, memo) for arg in args)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:146: in deepcopy
    y = copier(x, memo)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:205: in _deepcopy_list
    append(deepcopy(a, memo))
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:172: in deepcopy
    y = _reconstruct(x, memo, *rv)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:264: in _reconstruct
    y = func(*args)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:263: in <genexpr>
    args = (deepcopy(arg, memo) for arg in args)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:172: in deepcopy
    y = _reconstruct(x, memo, *rv)
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/copy.py:264: in _reconstruct
    y = func(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   ValueError: No type alias for extension<datasets.features.features.array3dextensiontype<array3dextensiontype>>

pyarrow/types.pxi:3157: ValueError

=========================== short test summary info ============================
FAILED tests/test_arrow_dataset.py::BaseDatasetTest::test_class_encode_column_on_disk - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::BaseDatasetTest::test_dummy_dataset_on_disk - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::BaseDatasetTest::test_tf_dataset_conversion_in_memory - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::BaseDatasetTest::test_tf_dataset_conversion_on_disk - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::BaseDatasetTest::test_tf_dataset_options_in_memory - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::BaseDatasetTest::test_tf_dataset_options_on_disk - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::BaseDatasetTest::test_to_csv_on_disk - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::BaseDatasetTest::test_to_parquet_on_disk - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::BaseDatasetTest::test_to_sql_on_disk - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::test_map_cases[True] - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::test_map_cases[False] - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/test_arrow_dataset.py::test_map_cases[mix] - ValueError: No type alias for extension<datasets.features.features.array2dextensiontype<array2dextensiontype>>
FAILED tests/features/test_array_xd.py::ArrayXDDynamicTest::test_map_dataset - ValueError: No type alias for extension<datasets.features.features.array3dextensiontype<array3dextensiontype>>
FAILED tests/features/test_array_xd.py::test_dataset_map[False] - ValueError: No type alias for extension<datasets.features.features.array3dextensiontype<array3dextensiontype>>
FAILED tests/features/test_array_xd.py::test_dataset_map[True] - ValueError: No type alias for extension<datasets.features.features.array3dextensiontype<array3dextensiontype>>
===== 15 failed,

github-actions · 2023-11-14T09:25:30Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007338 / 0.011353 (-0.004015)	0.004308 / 0.011008 (-0.006700)	0.088788 / 0.038508 (0.050280)	0.039369 / 0.023109 (0.016260)	0.334527 / 0.275898 (0.058629)	0.373748 / 0.323480 (0.050268)	0.005550 / 0.007986 (-0.002435)	0.003606 / 0.004328 (-0.000723)	0.072238 / 0.004250 (0.067988)	0.061271 / 0.037052 (0.024218)	0.336333 / 0.258489 (0.077844)	0.398256 / 0.293841 (0.104415)	0.041941 / 0.128546 (-0.086605)	0.013372 / 0.075646 (-0.062274)	0.336221 / 0.419271 (-0.083050)	0.083013 / 0.043533 (0.039480)	0.334743 / 0.255139 (0.079604)	0.362572 / 0.283200 (0.079373)	0.031161 / 0.141683 (-0.110521)	1.563441 / 1.452155 (0.111287)	1.704059 / 1.492716 (0.211343)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.252978 / 0.018006 (0.234972)	0.506348 / 0.000490 (0.505859)	0.011679 / 0.000200 (0.011479)	0.000104 / 0.000054 (0.000049)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026257 / 0.037411 (-0.011154)	0.085936 / 0.014526 (0.071410)	0.098542 / 0.176557 (-0.078015)	0.154507 / 0.737135 (-0.582628)	0.111493 / 0.296338 (-0.184845)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.575941 / 0.215209 (0.360732)	5.590230 / 2.077655 (3.512576)	2.463330 / 1.504120 (0.959211)	2.125760 / 1.541195 (0.584565)	2.095933 / 1.468490 (0.627443)	0.844768 / 4.584777 (-3.740009)	4.768995 / 3.745712 (1.023282)	4.670484 / 5.269862 (-0.599377)	2.630386 / 4.565676 (-1.935290)	0.085996 / 0.424275 (-0.338279)	0.007900 / 0.007607 (0.000293)	0.685463 / 0.226044 (0.459419)	6.699310 / 2.268929 (4.430381)	3.132542 / 55.444624 (-52.312083)	2.527963 / 6.876477 (-4.348513)	2.381835 / 2.142072 (0.239763)	0.909668 / 4.805227 (-3.895559)	0.209979 / 6.500664 (-6.290685)	0.079222 / 0.075469 (0.003753)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.444895 / 1.841788 (-0.396892)	20.388140 / 8.074308 (12.313832)	19.354148 / 10.191392 (9.162756)	0.222433 / 0.680424 (-0.457991)	0.029710 / 0.534201 (-0.504491)	0.427153 / 0.579283 (-0.152130)	0.537500 / 0.434364 (0.103136)	0.506917 / 0.540337 (-0.033421)	0.726088 / 1.386936 (-0.660848)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007652 / 0.011353 (-0.003701)	0.004320 / 0.011008 (-0.006688)	0.072721 / 0.038508 (0.034212)	0.068204 / 0.023109 (0.045095)	0.392087 / 0.275898 (0.116189)	0.431638 / 0.323480 (0.108158)	0.005419 / 0.007986 (-0.002566)	0.004305 / 0.004328 (-0.000023)	0.069042 / 0.004250 (0.064791)	0.051555 / 0.037052 (0.014503)	0.412141 / 0.258489 (0.153651)	0.438802 / 0.293841 (0.144961)	0.043631 / 0.128546 (-0.084915)	0.014169 / 0.075646 (-0.061478)	0.079571 / 0.419271 (-0.339701)	0.056707 / 0.043533 (0.013174)	0.413698 / 0.255139 (0.158559)	0.414127 / 0.283200 (0.130928)	0.031380 / 0.141683 (-0.110303)	1.677157 / 1.452155 (0.225003)	1.755155 / 1.492716 (0.262439)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.257236 / 0.018006 (0.239230)	0.521347 / 0.000490 (0.520858)	0.006282 / 0.000200 (0.006082)	0.000139 / 0.000054 (0.000085)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028433 / 0.037411 (-0.008978)	0.087698 / 0.014526 (0.073172)	0.108840 / 0.176557 (-0.067716)	0.157432 / 0.737135 (-0.579704)	0.103144 / 0.296338 (-0.193195)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.598745 / 0.215209 (0.383536)	5.981460 / 2.077655 (3.903805)	2.556931 / 1.504120 (1.052811)	2.179915 / 1.541195 (0.638720)	2.240841 / 1.468490 (0.772351)	0.811501 / 4.584777 (-3.773276)	4.718282 / 3.745712 (0.972570)	4.365738 / 5.269862 (-0.904124)	2.669798 / 4.565676 (-1.895878)	0.099135 / 0.424275 (-0.325140)	0.007369 / 0.007607 (-0.000238)	0.669491 / 0.226044 (0.443447)	6.700389 / 2.268929 (4.431461)	3.155328 / 55.444624 (-52.289296)	2.563375 / 6.876477 (-4.313102)	2.545191 / 2.142072 (0.403119)	0.961359 / 4.805227 (-3.843868)	0.189391 / 6.500664 (-6.311273)	0.061597 / 0.075469 (-0.013873)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.564008 / 1.841788 (-0.277780)	21.401307 / 8.074308 (13.326999)	20.693441 / 10.191392 (10.502049)	0.229340 / 0.680424 (-0.451084)	0.033637 / 0.534201 (-0.500564)	0.429394 / 0.579283 (-0.149889)	0.557202 / 0.434364 (0.122838)	0.510284 / 0.540337 (-0.030054)	0.725661 / 1.386936 (-0.661276)

HuggingFaceDocBuilderDev · 2023-11-14T10:11:41Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-11-14T10:12:34Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004820 / 0.011353 (-0.006533)	0.003152 / 0.011008 (-0.007856)	0.061842 / 0.038508 (0.023334)	0.030127 / 0.023109 (0.007018)	0.257409 / 0.275898 (-0.018489)	0.269382 / 0.323480 (-0.054097)	0.004288 / 0.007986 (-0.003698)	0.002500 / 0.004328 (-0.001829)	0.048520 / 0.004250 (0.044270)	0.046815 / 0.037052 (0.009763)	0.245858 / 0.258489 (-0.012631)	0.289636 / 0.293841 (-0.004205)	0.023983 / 0.128546 (-0.104563)	0.007336 / 0.075646 (-0.068310)	0.202347 / 0.419271 (-0.216924)	0.057737 / 0.043533 (0.014204)	0.245922 / 0.255139 (-0.009217)	0.268788 / 0.283200 (-0.014412)	0.017819 / 0.141683 (-0.123864)	1.149889 / 1.452155 (-0.302265)	1.227192 / 1.492716 (-0.265524)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092234 / 0.018006 (0.074228)	0.310259 / 0.000490 (0.309769)	0.000223 / 0.000200 (0.000023)	0.000044 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019059 / 0.037411 (-0.018352)	0.064904 / 0.014526 (0.050378)	0.073531 / 0.176557 (-0.103026)	0.120879 / 0.737135 (-0.616257)	0.075410 / 0.296338 (-0.220929)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.275364 / 0.215209 (0.060155)	2.724379 / 2.077655 (0.646725)	1.447617 / 1.504120 (-0.056503)	1.366794 / 1.541195 (-0.174401)	1.345849 / 1.468490 (-0.122641)	0.411205 / 4.584777 (-4.173572)	2.412712 / 3.745712 (-1.333000)	2.612469 / 5.269862 (-2.657393)	1.552113 / 4.565676 (-3.013564)	0.045783 / 0.424275 (-0.378492)	0.004782 / 0.007607 (-0.002825)	0.339218 / 0.226044 (0.113174)	3.359540 / 2.268929 (1.090612)	1.821369 / 55.444624 (-53.623256)	1.540742 / 6.876477 (-5.335734)	1.531845 / 2.142072 (-0.610227)	0.462009 / 4.805227 (-4.343218)	0.097794 / 6.500664 (-6.402870)	0.041222 / 0.075469 (-0.034247)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.938319 / 1.841788 (-0.903469)	11.712003 / 8.074308 (3.637695)	10.325317 / 10.191392 (0.133925)	0.126812 / 0.680424 (-0.553612)	0.013734 / 0.534201 (-0.520467)	0.279509 / 0.579283 (-0.299774)	0.269265 / 0.434364 (-0.165099)	0.322033 / 0.540337 (-0.218304)	0.441610 / 1.386936 (-0.945326)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004882 / 0.011353 (-0.006471)	0.002984 / 0.011008 (-0.008024)	0.048318 / 0.038508 (0.009810)	0.054642 / 0.023109 (0.031533)	0.268599 / 0.275898 (-0.007299)	0.292916 / 0.323480 (-0.030564)	0.004108 / 0.007986 (-0.003878)	0.002500 / 0.004328 (-0.001829)	0.048452 / 0.004250 (0.044202)	0.038835 / 0.037052 (0.001782)	0.275410 / 0.258489 (0.016921)	0.307284 / 0.293841 (0.013443)	0.024720 / 0.128546 (-0.103826)	0.007274 / 0.075646 (-0.068372)	0.054419 / 0.419271 (-0.364853)	0.032815 / 0.043533 (-0.010718)	0.273660 / 0.255139 (0.018521)	0.289183 / 0.283200 (0.005984)	0.017746 / 0.141683 (-0.123937)	1.153876 / 1.452155 (-0.298278)	1.212778 / 1.492716 (-0.279938)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.095286 / 0.018006 (0.077280)	0.305185 / 0.000490 (0.304696)	0.000230 / 0.000200 (0.000030)	0.000054 / 0.000054 (-0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021556 / 0.037411 (-0.015855)	0.071029 / 0.014526 (0.056503)	0.081914 / 0.176557 (-0.094643)	0.120553 / 0.737135 (-0.616582)	0.086696 / 0.296338 (-0.209642)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.289750 / 0.215209 (0.074541)	2.794247 / 2.077655 (0.716592)	1.577105 / 1.504120 (0.072985)	1.457706 / 1.541195 (-0.083489)	1.500481 / 1.468490 (0.031991)	0.403834 / 4.584777 (-4.180943)	2.466810 / 3.745712 (-1.278902)	2.701008 / 5.269862 (-2.568854)	1.634821 / 4.565676 (-2.930856)	0.046954 / 0.424275 (-0.377322)	0.004811 / 0.007607 (-0.002796)	0.347622 / 0.226044 (0.121578)	3.407125 / 2.268929 (1.138197)	1.987121 / 55.444624 (-53.457504)	1.689978 / 6.876477 (-5.186499)	1.731801 / 2.142072 (-0.410271)	0.478926 / 4.805227 (-4.326301)	0.100730 / 6.500664 (-6.399934)	0.043078 / 0.075469 (-0.032391)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.963575 / 1.841788 (-0.878212)	12.675331 / 8.074308 (4.601023)	11.167584 / 10.191392 (0.976192)	0.131199 / 0.680424 (-0.549225)	0.016030 / 0.534201 (-0.518171)	0.277783 / 0.579283 (-0.301500)	0.278693 / 0.434364 (-0.155671)	0.315141 / 0.540337 (-0.225196)	0.429104 / 1.386936 (-0.957832)

github-actions · 2023-11-14T10:29:47Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004807 / 0.011353 (-0.006546)	0.002925 / 0.011008 (-0.008083)	0.062560 / 0.038508 (0.024052)	0.029926 / 0.023109 (0.006817)	0.264708 / 0.275898 (-0.011190)	0.273464 / 0.323480 (-0.050016)	0.003197 / 0.007986 (-0.004788)	0.002544 / 0.004328 (-0.001784)	0.048230 / 0.004250 (0.043980)	0.046552 / 0.037052 (0.009500)	0.249553 / 0.258489 (-0.008936)	0.282078 / 0.293841 (-0.011762)	0.023201 / 0.128546 (-0.105346)	0.007306 / 0.075646 (-0.068340)	0.241361 / 0.419271 (-0.177910)	0.058286 / 0.043533 (0.014753)	0.245854 / 0.255139 (-0.009285)	0.266053 / 0.283200 (-0.017146)	0.020294 / 0.141683 (-0.121388)	1.102215 / 1.452155 (-0.349939)	1.170733 / 1.492716 (-0.321984)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.094647 / 0.018006 (0.076641)	0.303819 / 0.000490 (0.303329)	0.000250 / 0.000200 (0.000050)	0.000055 / 0.000054 (0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019036 / 0.037411 (-0.018375)	0.064729 / 0.014526 (0.050203)	0.074143 / 0.176557 (-0.102414)	0.120082 / 0.737135 (-0.617054)	0.076835 / 0.296338 (-0.219503)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.283786 / 0.215209 (0.068577)	2.751446 / 2.077655 (0.673791)	1.473789 / 1.504120 (-0.030331)	1.336968 / 1.541195 (-0.204226)	1.384148 / 1.468490 (-0.084342)	0.397452 / 4.584777 (-4.187325)	2.388042 / 3.745712 (-1.357670)	2.661291 / 5.269862 (-2.608571)	1.595454 / 4.565676 (-2.970223)	0.045919 / 0.424275 (-0.378356)	0.004879 / 0.007607 (-0.002728)	0.337862 / 0.226044 (0.111818)	3.355665 / 2.268929 (1.086737)	1.875261 / 55.444624 (-53.569363)	1.540874 / 6.876477 (-5.335603)	1.653632 / 2.142072 (-0.488440)	0.473090 / 4.805227 (-4.332138)	0.100151 / 6.500664 (-6.400513)	0.042357 / 0.075469 (-0.033112)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.959550 / 1.841788 (-0.882238)	12.307145 / 8.074308 (4.232837)	10.719321 / 10.191392 (0.527929)	0.128376 / 0.680424 (-0.552048)	0.014406 / 0.534201 (-0.519795)	0.295208 / 0.579283 (-0.284075)	0.268891 / 0.434364 (-0.165473)	0.305446 / 0.540337 (-0.234892)	0.429591 / 1.386936 (-0.957345)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005189 / 0.011353 (-0.006164)	0.003082 / 0.011008 (-0.007926)	0.048956 / 0.038508 (0.010448)	0.063403 / 0.023109 (0.040294)	0.272858 / 0.275898 (-0.003040)	0.295207 / 0.323480 (-0.028273)	0.004253 / 0.007986 (-0.003733)	0.002552 / 0.004328 (-0.001776)	0.048042 / 0.004250 (0.043792)	0.040429 / 0.037052 (0.003377)	0.269614 / 0.258489 (0.011125)	0.307205 / 0.293841 (0.013364)	0.027912 / 0.128546 (-0.100634)	0.007621 / 0.075646 (-0.068026)	0.054020 / 0.419271 (-0.365251)	0.036958 / 0.043533 (-0.006574)	0.272457 / 0.255139 (0.017318)	0.287966 / 0.283200 (0.004766)	0.019542 / 0.141683 (-0.122141)	1.116742 / 1.452155 (-0.335413)	1.194739 / 1.492716 (-0.297977)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093532 / 0.018006 (0.075526)	0.303262 / 0.000490 (0.302773)	0.000217 / 0.000200 (0.000017)	0.000042 / 0.000054 (-0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021984 / 0.037411 (-0.015428)	0.075024 / 0.014526 (0.060498)	0.080959 / 0.176557 (-0.095598)	0.121780 / 0.737135 (-0.615356)	0.082817 / 0.296338 (-0.213522)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.292766 / 0.215209 (0.077557)	2.857457 / 2.077655 (0.779802)	1.621860 / 1.504120 (0.117740)	1.473783 / 1.541195 (-0.067412)	1.535211 / 1.468490 (0.066721)	0.402212 / 4.584777 (-4.182565)	2.467143 / 3.745712 (-1.278569)	2.618162 / 5.269862 (-2.651700)	1.568682 / 4.565676 (-2.996994)	0.047123 / 0.424275 (-0.377152)	0.004780 / 0.007607 (-0.002827)	0.346959 / 0.226044 (0.120914)	3.395196 / 2.268929 (1.126268)	1.957835 / 55.444624 (-53.486789)	1.674287 / 6.876477 (-5.202190)	1.715879 / 2.142072 (-0.426193)	0.479481 / 4.805227 (-4.325746)	0.100043 / 6.500664 (-6.400621)	0.041289 / 0.075469 (-0.034180)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.965418 / 1.841788 (-0.876370)	12.703830 / 8.074308 (4.629522)	11.301401 / 10.191392 (1.110009)	0.131429 / 0.680424 (-0.548995)	0.016597 / 0.534201 (-0.517604)	0.273290 / 0.579283 (-0.305993)	0.285400 / 0.434364 (-0.148964)	0.307327 / 0.540337 (-0.233011)	0.434186 / 1.386936 (-0.952750)

* Replace pa.PyExtensionType with pa.ExtensionType * Register user-defined extension types * Pin minimum pyarrow version to 14.0.1 * Temporarily pin minimum pyarrow due to beam constraint * Remove constraint on pyarrow by removing unneeded upper beam version * Reset pyarrow minimum due to apache-beam constraint * Revert last 2 commits * Revert minimum pyarrow version and use pyarrow-hotfix * Add pa.ExtensionType.__reduce__

Merge in fix from huggingface#6404

albertvillanova added 2 commits November 13, 2023 10:12

Replace pa.PyExtensionType with pa.ExtensionType

7d785f1

Register user-defined extension types

d54b645

Pin minimum pyarrow version to 14.0.1

04a3f00

Temporarily pin minimum pyarrow due to beam constraint

98871b9

albertvillanova added 2 commits November 13, 2023 11:32

Remove constraint on pyarrow by removing unneeded upper beam version

aecdc94

Reset pyarrow minimum due to apache-beam constraint

998623f

Revert last 2 commits

05200c0

lhoestq approved these changes Nov 13, 2023

View reviewed changes

Revert minimum pyarrow version and use pyarrow-hotfix

980ad4c

Add pa.ExtensionType.__reduce__

45abe29

Merge remote-tracking branch 'upstream/main' into fix-6396

825c1d2

albertvillanova changed the title ~~Support pyarrow 14.0.1~~ Support pyarrow 14.0.1 and fix vulnerability Nov 14, 2023

albertvillanova changed the title ~~Support pyarrow 14.0.1 and fix vulnerability~~ Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 Nov 14, 2023

albertvillanova merged commit c096bd2 into main Nov 14, 2023
13 checks passed

albertvillanova deleted the fix-6396 branch November 14, 2023 10:23

YQ-Wang pushed a commit to YQ-Wang/datasets that referenced this pull request Dec 5, 2023

merge in fix from huggingface#6404

976f3e4

YQ-Wang mentioned this pull request Dec 5, 2023

Merge in fix from https://github.com/huggingface/datasets/pull/6404 instabase/datasets#1

Merged

YQ-Wang added a commit to instabase/datasets that referenced this pull request Dec 5, 2023

Merge pull request #1 from YQ-Wang/1.15.1-fix

c3c45db

Merge in fix from huggingface#6404

daskol mentioned this pull request Jun 13, 2024

packaging: Remove useless dependencies #6971

Merged

	run: pip install pyarrow==14.0.1 huggingface-hub==0.18.0 transformers dill==0.3.1.1
	run: pip install pyarrow==9.0.0 huggingface-hub==0.18.0 transformers dill==0.3.1.1

Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 #6404

Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 #6404

Conversation

albertvillanova commented Nov 13, 2023 • edited Loading

github-actions bot commented Nov 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Nov 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Nov 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Nov 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Nov 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Nov 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented Nov 13, 2023

lhoestq left a comment • edited Loading

Choose a reason for hiding this comment

lhoestq Nov 13, 2023

Choose a reason for hiding this comment

lhoestq Nov 13, 2023

Choose a reason for hiding this comment

lhoestq commented Nov 13, 2023

mariosasko commented Nov 13, 2023

github-actions bot commented Nov 14, 2023

albertvillanova commented Nov 13, 2023 •

edited

Loading

lhoestq left a comment •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 14, 2023 •

edited

Loading