Misc improvements #6004

mariosasko · 2023-07-03T18:29:14Z

Contains the following improvements:

fixes a "share dataset" link in README and modifies the "hosting" part in the disclaimer section
updates Makefile to also run the style checks on utils and setup.py
deletes a test for GH-hosted datasets (no longer supported)
deletes convert_dataset.sh (outdated)
aligns utils/release.py with transformers (the current version is outdated)

github-actions · 2023-07-03T18:29:33Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006897 / 0.011353 (-0.004456)	0.004207 / 0.011008 (-0.006802)	0.104828 / 0.038508 (0.066320)	0.048054 / 0.023109 (0.024945)	0.373991 / 0.275898 (0.098093)	0.426740 / 0.323480 (0.103260)	0.005540 / 0.007986 (-0.002446)	0.003531 / 0.004328 (-0.000797)	0.079304 / 0.004250 (0.075053)	0.066996 / 0.037052 (0.029944)	0.370675 / 0.258489 (0.112186)	0.414154 / 0.293841 (0.120313)	0.031567 / 0.128546 (-0.096979)	0.008843 / 0.075646 (-0.066803)	0.357426 / 0.419271 (-0.061845)	0.067040 / 0.043533 (0.023508)	0.362384 / 0.255139 (0.107245)	0.376056 / 0.283200 (0.092856)	0.032985 / 0.141683 (-0.108697)	1.560603 / 1.452155 (0.108448)	1.619024 / 1.492716 (0.126308)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.229059 / 0.018006 (0.211053)	0.440513 / 0.000490 (0.440023)	0.004647 / 0.000200 (0.004447)	0.000085 / 0.000054 (0.000030)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029517 / 0.037411 (-0.007894)	0.120974 / 0.014526 (0.106448)	0.125070 / 0.176557 (-0.051486)	0.184695 / 0.737135 (-0.552441)	0.130244 / 0.296338 (-0.166095)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.436930 / 0.215209 (0.221721)	4.356118 / 2.077655 (2.278463)	2.049169 / 1.504120 (0.545049)	1.842898 / 1.541195 (0.301703)	1.918948 / 1.468490 (0.450458)	0.553573 / 4.584777 (-4.031204)	3.883195 / 3.745712 (0.137483)	3.209780 / 5.269862 (-2.060081)	1.551707 / 4.565676 (-3.013970)	0.068181 / 0.424275 (-0.356094)	0.012370 / 0.007607 (0.004762)	0.539899 / 0.226044 (0.313854)	5.380008 / 2.268929 (3.111079)	2.518178 / 55.444624 (-52.926446)	2.174190 / 6.876477 (-4.702286)	2.317812 / 2.142072 (0.175740)	0.674154 / 4.805227 (-4.131073)	0.149313 / 6.500664 (-6.351351)	0.068297 / 0.075469 (-0.007172)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.261426 / 1.841788 (-0.580362)	15.316378 / 8.074308 (7.242070)	13.573512 / 10.191392 (3.382120)	0.190022 / 0.680424 (-0.490401)	0.018697 / 0.534201 (-0.515504)	0.448122 / 0.579283 (-0.131161)	0.435044 / 0.434364 (0.000681)	0.550065 / 0.540337 (0.009728)	0.653547 / 1.386936 (-0.733389)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007116 / 0.011353 (-0.004237)	0.004375 / 0.011008 (-0.006633)	0.081793 / 0.038508 (0.043285)	0.047980 / 0.023109 (0.024871)	0.392185 / 0.275898 (0.116287)	0.462263 / 0.323480 (0.138783)	0.005574 / 0.007986 (-0.002412)	0.003552 / 0.004328 (-0.000776)	0.080413 / 0.004250 (0.076162)	0.065539 / 0.037052 (0.028487)	0.413137 / 0.258489 (0.154648)	0.467377 / 0.293841 (0.173536)	0.034386 / 0.128546 (-0.094160)	0.009183 / 0.075646 (-0.066464)	0.087542 / 0.419271 (-0.331730)	0.053954 / 0.043533 (0.010421)	0.385096 / 0.255139 (0.129957)	0.404900 / 0.283200 (0.121701)	0.025908 / 0.141683 (-0.115775)	1.550159 / 1.452155 (0.098005)	1.598794 / 1.492716 (0.106078)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.246222 / 0.018006 (0.228216)	0.441095 / 0.000490 (0.440605)	0.006863 / 0.000200 (0.006663)	0.000109 / 0.000054 (0.000055)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032179 / 0.037411 (-0.005233)	0.120112 / 0.014526 (0.105586)	0.129326 / 0.176557 (-0.047230)	0.184542 / 0.737135 (-0.552593)	0.135038 / 0.296338 (-0.161300)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.459002 / 0.215209 (0.243793)	4.580258 / 2.077655 (2.502604)	2.296689 / 1.504120 (0.792569)	2.104338 / 1.541195 (0.563143)	2.182896 / 1.468490 (0.714406)	0.546447 / 4.584777 (-4.038330)	3.854047 / 3.745712 (0.108335)	1.873829 / 5.269862 (-3.396032)	1.116484 / 4.565676 (-3.449193)	0.067158 / 0.424275 (-0.357117)	0.012035 / 0.007607 (0.004428)	0.556642 / 0.226044 (0.330597)	5.574436 / 2.268929 (3.305508)	2.828223 / 55.444624 (-52.616402)	2.519851 / 6.876477 (-4.356626)	2.668594 / 2.142072 (0.526521)	0.675989 / 4.805227 (-4.129238)	0.146075 / 6.500664 (-6.354589)	0.067788 / 0.075469 (-0.007681)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.345958 / 1.841788 (-0.495830)	15.672748 / 8.074308 (7.598440)	14.937583 / 10.191392 (4.746191)	0.163479 / 0.680424 (-0.516945)	0.018364 / 0.534201 (-0.515837)	0.433296 / 0.579283 (-0.145987)	0.432463 / 0.434364 (-0.001901)	0.512000 / 0.540337 (-0.028338)	0.619397 / 1.386936 (-0.767539)

HuggingFaceDocBuilderDev · 2023-07-03T18:34:20Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

LGTM

utils/release.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

github-actions · 2023-07-06T16:36:53Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010097 / 0.011353 (-0.001256)	0.005070 / 0.011008 (-0.005939)	0.118638 / 0.038508 (0.080130)	0.043651 / 0.023109 (0.020542)	0.356074 / 0.275898 (0.080176)	0.414578 / 0.323480 (0.091098)	0.005939 / 0.007986 (-0.002046)	0.004927 / 0.004328 (0.000598)	0.089545 / 0.004250 (0.085294)	0.067533 / 0.037052 (0.030481)	0.371550 / 0.258489 (0.113061)	0.417808 / 0.293841 (0.123967)	0.045186 / 0.128546 (-0.083361)	0.015763 / 0.075646 (-0.059883)	0.393304 / 0.419271 (-0.025967)	0.065123 / 0.043533 (0.021591)	0.345057 / 0.255139 (0.089918)	0.378809 / 0.283200 (0.095610)	0.033243 / 0.141683 (-0.108440)	1.679956 / 1.452155 (0.227802)	1.775456 / 1.492716 (0.282739)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.229723 / 0.018006 (0.211717)	0.554630 / 0.000490 (0.554140)	0.008729 / 0.000200 (0.008529)	0.000183 / 0.000054 (0.000129)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027284 / 0.037411 (-0.010128)	0.114741 / 0.014526 (0.100215)	0.129188 / 0.176557 (-0.047369)	0.189270 / 0.737135 (-0.547866)	0.126000 / 0.296338 (-0.170339)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.580417 / 0.215209 (0.365208)	5.829337 / 2.077655 (3.751683)	2.421191 / 1.504120 (0.917071)	2.063673 / 1.541195 (0.522479)	2.133427 / 1.468490 (0.664937)	0.830964 / 4.584777 (-3.753813)	5.107139 / 3.745712 (1.361427)	4.599451 / 5.269862 (-0.670410)	2.406502 / 4.565676 (-2.159175)	0.100422 / 0.424275 (-0.323853)	0.011850 / 0.007607 (0.004243)	0.741881 / 0.226044 (0.515836)	7.425689 / 2.268929 (5.156760)	3.068948 / 55.444624 (-52.375676)	2.496292 / 6.876477 (-4.380184)	2.566420 / 2.142072 (0.424348)	1.093084 / 4.805227 (-3.712144)	0.224106 / 6.500664 (-6.276558)	0.084549 / 0.075469 (0.009080)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.416315 / 1.841788 (-0.425473)	16.306901 / 8.074308 (8.232593)	19.792419 / 10.191392 (9.601027)	0.224223 / 0.680424 (-0.456201)	0.026385 / 0.534201 (-0.507816)	0.463460 / 0.579283 (-0.115823)	0.598385 / 0.434364 (0.164021)	0.543981 / 0.540337 (0.003644)	0.647454 / 1.386936 (-0.739482)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009470 / 0.011353 (-0.001883)	0.004800 / 0.011008 (-0.006208)	0.094276 / 0.038508 (0.055768)	0.045157 / 0.023109 (0.022048)	0.397302 / 0.275898 (0.121404)	0.474213 / 0.323480 (0.150733)	0.005826 / 0.007986 (-0.002160)	0.003724 / 0.004328 (-0.000605)	0.090060 / 0.004250 (0.085809)	0.066671 / 0.037052 (0.029618)	0.439560 / 0.258489 (0.181071)	0.468598 / 0.293841 (0.174757)	0.044549 / 0.128546 (-0.083997)	0.014000 / 0.075646 (-0.061646)	0.110457 / 0.419271 (-0.308815)	0.065898 / 0.043533 (0.022365)	0.408101 / 0.255139 (0.152962)	0.433473 / 0.283200 (0.150273)	0.038438 / 0.141683 (-0.103245)	1.767781 / 1.452155 (0.315626)	1.791575 / 1.492716 (0.298859)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.230257 / 0.018006 (0.212251)	0.492280 / 0.000490 (0.491790)	0.005110 / 0.000200 (0.004910)	0.000119 / 0.000054 (0.000065)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028854 / 0.037411 (-0.008557)	0.111702 / 0.014526 (0.097176)	0.122040 / 0.176557 (-0.054517)	0.179103 / 0.737135 (-0.558032)	0.128869 / 0.296338 (-0.167470)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.634795 / 0.215209 (0.419586)	6.204760 / 2.077655 (4.127105)	2.692479 / 1.504120 (1.188359)	2.324260 / 1.541195 (0.783066)	2.380640 / 1.468490 (0.912149)	0.887827 / 4.584777 (-3.696950)	5.251648 / 3.745712 (1.505935)	2.632767 / 5.269862 (-2.637095)	1.745721 / 4.565676 (-2.819955)	0.108364 / 0.424275 (-0.315911)	0.013409 / 0.007607 (0.005802)	0.783427 / 0.226044 (0.557383)	7.765144 / 2.268929 (5.496216)	3.340686 / 55.444624 (-52.103938)	2.715340 / 6.876477 (-4.161137)	2.768604 / 2.142072 (0.626531)	1.119746 / 4.805227 (-3.685481)	0.210804 / 6.500664 (-6.289860)	0.072600 / 0.075469 (-0.002869)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.517334 / 1.841788 (-0.324454)	17.046837 / 8.074308 (8.972529)	19.371090 / 10.191392 (9.179698)	0.194275 / 0.680424 (-0.486148)	0.026712 / 0.534201 (-0.507488)	0.462731 / 0.579283 (-0.116552)	0.568958 / 0.434364 (0.134595)	0.555707 / 0.540337 (0.015370)	0.663654 / 1.386936 (-0.723283)

github-actions · 2023-07-06T17:04:10Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006423 / 0.011353 (-0.004930)	0.003882 / 0.011008 (-0.007126)	0.082976 / 0.038508 (0.044468)	0.071281 / 0.023109 (0.048171)	0.311367 / 0.275898 (0.035469)	0.348228 / 0.323480 (0.024748)	0.005315 / 0.007986 (-0.002671)	0.003326 / 0.004328 (-0.001003)	0.064641 / 0.004250 (0.060391)	0.056134 / 0.037052 (0.019081)	0.314071 / 0.258489 (0.055582)	0.360534 / 0.293841 (0.066693)	0.030642 / 0.128546 (-0.097904)	0.008301 / 0.075646 (-0.067345)	0.285820 / 0.419271 (-0.133451)	0.069241 / 0.043533 (0.025708)	0.313995 / 0.255139 (0.058856)	0.336656 / 0.283200 (0.053457)	0.031686 / 0.141683 (-0.109997)	1.467627 / 1.452155 (0.015472)	1.536493 / 1.492716 (0.043777)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.196518 / 0.018006 (0.178512)	0.458235 / 0.000490 (0.457745)	0.005599 / 0.000200 (0.005399)	0.000088 / 0.000054 (0.000034)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027371 / 0.037411 (-0.010040)	0.080986 / 0.014526 (0.066460)	0.093296 / 0.176557 (-0.083260)	0.150592 / 0.737135 (-0.586543)	0.094150 / 0.296338 (-0.202188)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.379412 / 0.215209 (0.164202)	3.797927 / 2.077655 (1.720272)	1.830654 / 1.504120 (0.326534)	1.669569 / 1.541195 (0.128374)	1.746738 / 1.468490 (0.278248)	0.479536 / 4.584777 (-4.105241)	3.592867 / 3.745712 (-0.152845)	5.468098 / 5.269862 (0.198237)	3.268013 / 4.565676 (-1.297663)	0.056635 / 0.424275 (-0.367640)	0.007224 / 0.007607 (-0.000383)	0.456681 / 0.226044 (0.230636)	4.566736 / 2.268929 (2.297807)	2.362831 / 55.444624 (-53.081793)	1.965141 / 6.876477 (-4.911336)	2.156905 / 2.142072 (0.014833)	0.572543 / 4.805227 (-4.232684)	0.132203 / 6.500664 (-6.368461)	0.059254 / 0.075469 (-0.016215)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.256134 / 1.841788 (-0.585654)	19.905438 / 8.074308 (11.831130)	14.179556 / 10.191392 (3.988164)	0.168043 / 0.680424 (-0.512381)	0.018215 / 0.534201 (-0.515986)	0.392740 / 0.579283 (-0.186543)	0.398397 / 0.434364 (-0.035967)	0.463806 / 0.540337 (-0.076531)	0.616248 / 1.386936 (-0.770688)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006564 / 0.011353 (-0.004789)	0.003923 / 0.011008 (-0.007085)	0.063929 / 0.038508 (0.025421)	0.073780 / 0.023109 (0.050671)	0.360242 / 0.275898 (0.084344)	0.395078 / 0.323480 (0.071598)	0.005265 / 0.007986 (-0.002720)	0.003229 / 0.004328 (-0.001100)	0.064094 / 0.004250 (0.059843)	0.057468 / 0.037052 (0.020416)	0.369530 / 0.258489 (0.111041)	0.411159 / 0.293841 (0.117318)	0.031278 / 0.128546 (-0.097268)	0.008424 / 0.075646 (-0.067222)	0.070411 / 0.419271 (-0.348860)	0.048714 / 0.043533 (0.005181)	0.361280 / 0.255139 (0.106141)	0.382468 / 0.283200 (0.099269)	0.023059 / 0.141683 (-0.118624)	1.452369 / 1.452155 (0.000215)	1.519192 / 1.492716 (0.026475)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.223745 / 0.018006 (0.205739)	0.442086 / 0.000490 (0.441596)	0.000379 / 0.000200 (0.000179)	0.000055 / 0.000054 (0.000001)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030919 / 0.037411 (-0.006493)	0.088483 / 0.014526 (0.073958)	0.101165 / 0.176557 (-0.075391)	0.154332 / 0.737135 (-0.582804)	0.103030 / 0.296338 (-0.193309)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.414520 / 0.215209 (0.199311)	4.126754 / 2.077655 (2.049099)	2.142677 / 1.504120 (0.638557)	1.995300 / 1.541195 (0.454106)	2.101678 / 1.468490 (0.633188)	0.481099 / 4.584777 (-4.103678)	3.562813 / 3.745712 (-0.182900)	3.392463 / 5.269862 (-1.877399)	1.983943 / 4.565676 (-2.581734)	0.056594 / 0.424275 (-0.367681)	0.007216 / 0.007607 (-0.000391)	0.495085 / 0.226044 (0.269041)	4.955640 / 2.268929 (2.686712)	2.629434 / 55.444624 (-52.815191)	2.269577 / 6.876477 (-4.606900)	2.357708 / 2.142072 (0.215635)	0.612370 / 4.805227 (-4.192857)	0.131169 / 6.500664 (-6.369495)	0.061029 / 0.075469 (-0.014440)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.339438 / 1.841788 (-0.502350)	19.757611 / 8.074308 (11.683303)	14.246254 / 10.191392 (4.054862)	0.170750 / 0.680424 (-0.509674)	0.018192 / 0.534201 (-0.516009)	0.395693 / 0.579283 (-0.183590)	0.411003 / 0.434364 (-0.023361)	0.478531 / 0.540337 (-0.061806)	0.650291 / 1.386936 (-0.736645)

mariosasko added 3 commits July 3, 2023 01:21

Misc improvements

280166e

More fixes

fc8397a

Check utils and setup.py style

0832d48

mariosasko requested a review from lhoestq July 6, 2023 15:23

lhoestq approved these changes Jul 6, 2023

View reviewed changes

utils/release.py Outdated Show resolved Hide resolved

Update utils/release.py

5d20476

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

mariosasko merged commit 3e34d06 into main Jul 6, 2023

mariosasko deleted the misc-improvement branch July 6, 2023 16:55

mariosasko mentioned this pull request Jul 20, 2023

Create release script #2478

Open

Misc improvements #6004

Misc improvements #6004

Conversation

mariosasko commented Jul 3, 2023

github-actions bot commented Jul 3, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jul 3, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 6, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jul 6, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jul 3, 2023 •

edited

Loading