Filter unsupported extensions #5972

lhoestq · 2023-06-21T15:43:01Z

I used a regex to filter the data files based on their extension for packaged builders.

I tried and a regex is 10x faster that using in to check if the extension is in the list of supported extensions.

Supersedes #5850

Close #5849

I also did a small change to favor the parquet module in case of a draw in the extension counter.

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

github-actions · 2023-06-21T15:49:49Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006983 / 0.011353 (-0.004369)	0.004473 / 0.011008 (-0.006535)	0.105158 / 0.038508 (0.066650)	0.048973 / 0.023109 (0.025864)	0.358771 / 0.275898 (0.082873)	0.432389 / 0.323480 (0.108909)	0.005689 / 0.007986 (-0.002297)	0.003584 / 0.004328 (-0.000744)	0.080852 / 0.004250 (0.076601)	0.066133 / 0.037052 (0.029081)	0.370981 / 0.258489 (0.112492)	0.406942 / 0.293841 (0.113101)	0.032123 / 0.128546 (-0.096424)	0.009313 / 0.075646 (-0.066333)	0.355220 / 0.419271 (-0.064051)	0.055768 / 0.043533 (0.012235)	0.370545 / 0.255139 (0.115406)	0.375619 / 0.283200 (0.092419)	0.024258 / 0.141683 (-0.117425)	1.559073 / 1.452155 (0.106918)	1.616520 / 1.492716 (0.123804)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.277893 / 0.018006 (0.259887)	0.535447 / 0.000490 (0.534957)	0.004877 / 0.000200 (0.004677)	0.000092 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029444 / 0.037411 (-0.007968)	0.114366 / 0.014526 (0.099841)	0.130957 / 0.176557 (-0.045599)	0.189604 / 0.737135 (-0.547531)	0.131682 / 0.296338 (-0.164656)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.412315 / 0.215209 (0.197106)	4.093879 / 2.077655 (2.016225)	1.856169 / 1.504120 (0.352050)	1.655358 / 1.541195 (0.114164)	1.758190 / 1.468490 (0.289699)	0.545829 / 4.584777 (-4.038948)	3.871436 / 3.745712 (0.125724)	1.938244 / 5.269862 (-3.331618)	1.122727 / 4.565676 (-3.442950)	0.067107 / 0.424275 (-0.357168)	0.012012 / 0.007607 (0.004405)	0.518868 / 0.226044 (0.292824)	5.235081 / 2.268929 (2.966153)	2.335115 / 55.444624 (-53.109509)	2.013074 / 6.876477 (-4.863402)	2.219808 / 2.142072 (0.077735)	0.674602 / 4.805227 (-4.130626)	0.147051 / 6.500664 (-6.353613)	0.068444 / 0.075469 (-0.007025)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.245600 / 1.841788 (-0.596188)	15.537727 / 8.074308 (7.463419)	15.074300 / 10.191392 (4.882908)	0.194217 / 0.680424 (-0.486207)	0.018536 / 0.534201 (-0.515665)	0.437085 / 0.579283 (-0.142198)	0.441123 / 0.434364 (0.006759)	0.530681 / 0.540337 (-0.009657)	0.649154 / 1.386936 (-0.737782)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007243 / 0.011353 (-0.004110)	0.004688 / 0.011008 (-0.006320)	0.079809 / 0.038508 (0.041301)	0.046915 / 0.023109 (0.023805)	0.415144 / 0.275898 (0.139246)	0.474867 / 0.323480 (0.151388)	0.004550 / 0.007986 (-0.003435)	0.004585 / 0.004328 (0.000257)	0.080837 / 0.004250 (0.076587)	0.061667 / 0.037052 (0.024614)	0.411321 / 0.258489 (0.152832)	0.464195 / 0.293841 (0.170354)	0.032510 / 0.128546 (-0.096037)	0.009306 / 0.075646 (-0.066340)	0.086637 / 0.419271 (-0.332635)	0.053335 / 0.043533 (0.009802)	0.402302 / 0.255139 (0.147163)	0.424864 / 0.283200 (0.141664)	0.026573 / 0.141683 (-0.115110)	1.566793 / 1.452155 (0.114639)	1.628118 / 1.492716 (0.135401)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.317802 / 0.018006 (0.299796)	0.544593 / 0.000490 (0.544103)	0.005690 / 0.000200 (0.005490)	0.000107 / 0.000054 (0.000053)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033015 / 0.037411 (-0.004397)	0.121940 / 0.014526 (0.107414)	0.132920 / 0.176557 (-0.043637)	0.191481 / 0.737135 (-0.545655)	0.139139 / 0.296338 (-0.157199)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.460382 / 0.215209 (0.245173)	4.610046 / 2.077655 (2.532392)	2.296573 / 1.504120 (0.792453)	2.099735 / 1.541195 (0.558540)	2.213913 / 1.468490 (0.745423)	0.544871 / 4.584777 (-4.039906)	3.814174 / 3.745712 (0.068462)	3.246397 / 5.269862 (-2.023464)	1.480236 / 4.565676 (-3.085440)	0.068464 / 0.424275 (-0.355811)	0.012651 / 0.007607 (0.005043)	0.564989 / 0.226044 (0.338944)	5.639188 / 2.268929 (3.370259)	2.827601 / 55.444624 (-52.617023)	2.473743 / 6.876477 (-4.402734)	2.567413 / 2.142072 (0.425340)	0.674351 / 4.805227 (-4.130876)	0.146248 / 6.500664 (-6.354416)	0.067553 / 0.075469 (-0.007916)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.346703 / 1.841788 (-0.495085)	16.494787 / 8.074308 (8.420479)	15.179487 / 10.191392 (4.988095)	0.181864 / 0.680424 (-0.498560)	0.018857 / 0.534201 (-0.515344)	0.437787 / 0.579283 (-0.141496)	0.431770 / 0.434364 (-0.002594)	0.507116 / 0.540337 (-0.033221)	0.608899 / 1.386936 (-0.778037)

HuggingFaceDocBuilderDev · 2023-06-21T16:55:44Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-06-21T16:57:13Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005963 / 0.011353 (-0.005390)	0.003743 / 0.011008 (-0.007265)	0.098519 / 0.038508 (0.060011)	0.037392 / 0.023109 (0.014283)	0.322706 / 0.275898 (0.046808)	0.380032 / 0.323480 (0.056552)	0.004694 / 0.007986 (-0.003292)	0.002897 / 0.004328 (-0.001432)	0.078664 / 0.004250 (0.074414)	0.052646 / 0.037052 (0.015594)	0.335523 / 0.258489 (0.077034)	0.375464 / 0.293841 (0.081623)	0.027537 / 0.128546 (-0.101010)	0.008452 / 0.075646 (-0.067194)	0.313844 / 0.419271 (-0.105427)	0.047368 / 0.043533 (0.003835)	0.313833 / 0.255139 (0.058694)	0.342284 / 0.283200 (0.059085)	0.021136 / 0.141683 (-0.120547)	1.544764 / 1.452155 (0.092610)	1.563850 / 1.492716 (0.071134)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.188609 / 0.018006 (0.170603)	0.421686 / 0.000490 (0.421196)	0.003336 / 0.000200 (0.003136)	0.000077 / 0.000054 (0.000023)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023678 / 0.037411 (-0.013733)	0.099191 / 0.014526 (0.084665)	0.105819 / 0.176557 (-0.070738)	0.169654 / 0.737135 (-0.567481)	0.110240 / 0.296338 (-0.186099)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.425497 / 0.215209 (0.210288)	4.237165 / 2.077655 (2.159510)	1.902953 / 1.504120 (0.398833)	1.699012 / 1.541195 (0.157818)	1.751107 / 1.468490 (0.282617)	0.563326 / 4.584777 (-4.021451)	3.394189 / 3.745712 (-0.351523)	2.706129 / 5.269862 (-2.563732)	1.361522 / 4.565676 (-3.204155)	0.067776 / 0.424275 (-0.356499)	0.010959 / 0.007607 (0.003352)	0.530905 / 0.226044 (0.304860)	5.322467 / 2.268929 (3.053538)	2.384356 / 55.444624 (-53.060269)	2.044196 / 6.876477 (-4.832281)	2.119837 / 2.142072 (-0.022235)	0.682236 / 4.805227 (-4.122991)	0.136921 / 6.500664 (-6.363743)	0.066784 / 0.075469 (-0.008685)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.210642 / 1.841788 (-0.631146)	13.804572 / 8.074308 (5.730264)	13.309229 / 10.191392 (3.117837)	0.154356 / 0.680424 (-0.526068)	0.016833 / 0.534201 (-0.517368)	0.366503 / 0.579283 (-0.212780)	0.385201 / 0.434364 (-0.049163)	0.426713 / 0.540337 (-0.113624)	0.516795 / 1.386936 (-0.870141)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006144 / 0.011353 (-0.005209)	0.003723 / 0.011008 (-0.007285)	0.077427 / 0.038508 (0.038919)	0.037636 / 0.023109 (0.014527)	0.375048 / 0.275898 (0.099150)	0.442254 / 0.323480 (0.118774)	0.003506 / 0.007986 (-0.004480)	0.003751 / 0.004328 (-0.000577)	0.076771 / 0.004250 (0.072521)	0.047915 / 0.037052 (0.010862)	0.378918 / 0.258489 (0.120429)	0.435300 / 0.293841 (0.141459)	0.028317 / 0.128546 (-0.100230)	0.008413 / 0.075646 (-0.067233)	0.082774 / 0.419271 (-0.336497)	0.043211 / 0.043533 (-0.000321)	0.362022 / 0.255139 (0.106883)	0.404928 / 0.283200 (0.121728)	0.020692 / 0.141683 (-0.120991)	1.527303 / 1.452155 (0.075148)	1.596091 / 1.492716 (0.103375)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.225537 / 0.018006 (0.207530)	0.399901 / 0.000490 (0.399412)	0.000424 / 0.000200 (0.000224)	0.000058 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026483 / 0.037411 (-0.010928)	0.104373 / 0.014526 (0.089847)	0.111271 / 0.176557 (-0.065286)	0.163872 / 0.737135 (-0.573264)	0.113991 / 0.296338 (-0.182347)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.456484 / 0.215209 (0.241275)	4.572652 / 2.077655 (2.494998)	2.374908 / 1.504120 (0.870788)	2.207855 / 1.541195 (0.666661)	2.260009 / 1.468490 (0.791519)	0.562678 / 4.584777 (-4.022099)	3.441778 / 3.745712 (-0.303934)	1.729006 / 5.269862 (-3.540855)	1.024937 / 4.565676 (-3.540739)	0.068707 / 0.424275 (-0.355568)	0.011334 / 0.007607 (0.003727)	0.564293 / 0.226044 (0.338248)	5.638367 / 2.268929 (3.369438)	2.665654 / 55.444624 (-52.778970)	2.320033 / 6.876477 (-4.556444)	2.328706 / 2.142072 (0.186634)	0.677433 / 4.805227 (-4.127794)	0.137190 / 6.500664 (-6.363474)	0.068585 / 0.075469 (-0.006885)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.312476 / 1.841788 (-0.529312)	14.206685 / 8.074308 (6.132377)	14.217928 / 10.191392 (4.026536)	0.143416 / 0.680424 (-0.537007)	0.016647 / 0.534201 (-0.517554)	0.361228 / 0.579283 (-0.218055)	0.396185 / 0.434364 (-0.038178)	0.423275 / 0.540337 (-0.117063)	0.512966 / 1.386936 (-0.873970)

github-actions · 2023-06-21T17:32:44Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008913 / 0.011353 (-0.002440)	0.005142 / 0.011008 (-0.005866)	0.133958 / 0.038508 (0.095449)	0.049180 / 0.023109 (0.026071)	0.389169 / 0.275898 (0.113270)	0.481513 / 0.323480 (0.158033)	0.006555 / 0.007986 (-0.001430)	0.003806 / 0.004328 (-0.000522)	0.102056 / 0.004250 (0.097806)	0.083259 / 0.037052 (0.046207)	0.392536 / 0.258489 (0.134047)	0.447503 / 0.293841 (0.153662)	0.047472 / 0.128546 (-0.081074)	0.014748 / 0.075646 (-0.060899)	0.475619 / 0.419271 (0.056348)	0.107306 / 0.043533 (0.063773)	0.421942 / 0.255139 (0.166803)	0.419736 / 0.283200 (0.136536)	0.044195 / 0.141683 (-0.097488)	1.793840 / 1.452155 (0.341686)	1.960204 / 1.492716 (0.467488)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.252046 / 0.018006 (0.234040)	0.627725 / 0.000490 (0.627236)	0.007435 / 0.000200 (0.007235)	0.000526 / 0.000054 (0.000472)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034656 / 0.037411 (-0.002755)	0.114534 / 0.014526 (0.100008)	0.135804 / 0.176557 (-0.040753)	0.209309 / 0.737135 (-0.527826)	0.140369 / 0.296338 (-0.155969)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.636736 / 0.215209 (0.421527)	6.039985 / 2.077655 (3.962330)	2.640141 / 1.504120 (1.136021)	2.284492 / 1.541195 (0.743297)	2.324956 / 1.468490 (0.856466)	0.934499 / 4.584777 (-3.650278)	5.673415 / 3.745712 (1.927703)	5.184584 / 5.269862 (-0.085278)	2.661911 / 4.565676 (-1.903766)	0.150420 / 0.424275 (-0.273855)	0.015655 / 0.007607 (0.008048)	0.748290 / 0.226044 (0.522246)	7.579755 / 2.268929 (5.310827)	3.346732 / 55.444624 (-52.097892)	2.708212 / 6.876477 (-4.168264)	2.682423 / 2.142072 (0.540351)	1.170389 / 4.805227 (-3.634838)	0.215775 / 6.500664 (-6.284889)	0.076360 / 0.075469 (0.000891)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.516794 / 1.841788 (-0.324993)	18.709117 / 8.074308 (10.634809)	22.492542 / 10.191392 (12.301150)	0.237978 / 0.680424 (-0.442446)	0.027828 / 0.534201 (-0.506373)	0.499968 / 0.579283 (-0.079315)	0.645899 / 0.434364 (0.211535)	0.548599 / 0.540337 (0.008262)	0.675428 / 1.386936 (-0.711508)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008469 / 0.011353 (-0.002884)	0.005420 / 0.011008 (-0.005589)	0.093340 / 0.038508 (0.054832)	0.045896 / 0.023109 (0.022786)	0.533267 / 0.275898 (0.257369)	0.596034 / 0.323480 (0.272555)	0.004816 / 0.007986 (-0.003170)	0.004379 / 0.004328 (0.000051)	0.096356 / 0.004250 (0.092106)	0.058339 / 0.037052 (0.021287)	0.574464 / 0.258489 (0.315975)	0.649301 / 0.293841 (0.355461)	0.047599 / 0.128546 (-0.080947)	0.013759 / 0.075646 (-0.061887)	0.104672 / 0.419271 (-0.314599)	0.061658 / 0.043533 (0.018125)	0.560956 / 0.255139 (0.305817)	0.585328 / 0.283200 (0.302128)	0.034137 / 0.141683 (-0.107546)	1.844528 / 1.452155 (0.392373)	1.971398 / 1.492716 (0.478682)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.278666 / 0.018006 (0.260660)	0.577342 / 0.000490 (0.576853)	0.005496 / 0.000200 (0.005296)	0.000131 / 0.000054 (0.000076)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029863 / 0.037411 (-0.007549)	0.161703 / 0.014526 (0.147177)	0.132279 / 0.176557 (-0.044277)	0.227345 / 0.737135 (-0.509791)	0.138047 / 0.296338 (-0.158291)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.651535 / 0.215209 (0.436326)	7.077949 / 2.077655 (5.000295)	2.926990 / 1.504120 (1.422871)	2.598872 / 1.541195 (1.057678)	2.614192 / 1.468490 (1.145702)	0.913845 / 4.584777 (-3.670932)	5.704301 / 3.745712 (1.958589)	2.796914 / 5.269862 (-2.472948)	1.836096 / 4.565676 (-2.729580)	0.106294 / 0.424275 (-0.317981)	0.012705 / 0.007607 (0.005098)	0.836336 / 0.226044 (0.610291)	8.234079 / 2.268929 (5.965150)	3.836410 / 55.444624 (-51.608215)	3.116752 / 6.876477 (-3.759724)	3.154258 / 2.142072 (1.012186)	1.195794 / 4.805227 (-3.609434)	0.240491 / 6.500664 (-6.260173)	0.087913 / 0.075469 (0.012444)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.724723 / 1.841788 (-0.117064)	19.492194 / 8.074308 (11.417885)	21.443341 / 10.191392 (11.251949)	0.245819 / 0.680424 (-0.434605)	0.027024 / 0.534201 (-0.507177)	0.481071 / 0.579283 (-0.098212)	0.596359 / 0.434364 (0.161995)	0.646462 / 0.540337 (0.106124)	0.706380 / 1.386936 (-0.680556)

mariosasko

Nice! One nit:

mariosasko · 2023-06-21T18:33:54Z

src/datasets/data_files.py

+        return DataFilesList(
+            [
+                data_file
+                for data_file in self
+                if pattern.match(data_file.name if isinstance(data_file, Path) else data_file)
+            ],
+            origin_metadata=self.origin_metadata,
+        )


Here we should also drop the origin metadata of the removed data files, no?

origin_metadata is the list of origin per pattern, not per data file. We don't know which pattern generated which data file, and a pattern may have generated multiple data files. So I don't think we can easily drop origin metadata during filtering.

Note that for packaged builders patterns often look like "**" or "train/**" so there's only a few origin_metadata. Adding or removing unsupported files in the dataset repo won't change the origin metadata.

github-actions · 2023-06-22T14:23:29Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006634 / 0.011353 (-0.004719)	0.004003 / 0.011008 (-0.007005)	0.097874 / 0.038508 (0.059365)	0.043528 / 0.023109 (0.020419)	0.302293 / 0.275898 (0.026395)	0.357041 / 0.323480 (0.033561)	0.003761 / 0.007986 (-0.004225)	0.004312 / 0.004328 (-0.000016)	0.076253 / 0.004250 (0.072003)	0.062807 / 0.037052 (0.025755)	0.316737 / 0.258489 (0.058248)	0.356722 / 0.293841 (0.062881)	0.030816 / 0.128546 (-0.097730)	0.008691 / 0.075646 (-0.066955)	0.328366 / 0.419271 (-0.090906)	0.062299 / 0.043533 (0.018766)	0.293877 / 0.255139 (0.038738)	0.319832 / 0.283200 (0.036632)	0.024996 / 0.141683 (-0.116687)	1.473912 / 1.452155 (0.021758)	1.565439 / 1.492716 (0.072723)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.208428 / 0.018006 (0.190422)	0.435618 / 0.000490 (0.435128)	0.000695 / 0.000200 (0.000495)	0.000056 / 0.000054 (0.000001)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026253 / 0.037411 (-0.011158)	0.106908 / 0.014526 (0.092382)	0.117075 / 0.176557 (-0.059482)	0.177969 / 0.737135 (-0.559166)	0.123400 / 0.296338 (-0.172938)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.424970 / 0.215209 (0.209761)	4.203233 / 2.077655 (2.125578)	2.009679 / 1.504120 (0.505559)	1.825691 / 1.541195 (0.284496)	1.870639 / 1.468490 (0.402149)	0.530758 / 4.584777 (-4.054019)	3.718791 / 3.745712 (-0.026921)	1.800206 / 5.269862 (-3.469656)	1.071651 / 4.565676 (-3.494025)	0.065126 / 0.424275 (-0.359149)	0.011312 / 0.007607 (0.003704)	0.532503 / 0.226044 (0.306458)	5.353950 / 2.268929 (3.085021)	2.463548 / 55.444624 (-52.981076)	2.139832 / 6.876477 (-4.736645)	2.238722 / 2.142072 (0.096650)	0.655736 / 4.805227 (-4.149492)	0.141689 / 6.500664 (-6.358975)	0.063282 / 0.075469 (-0.012187)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.183523 / 1.841788 (-0.658265)	14.146428 / 8.074308 (6.072120)	14.312883 / 10.191392 (4.121491)	0.169286 / 0.680424 (-0.511138)	0.017343 / 0.534201 (-0.516858)	0.397934 / 0.579283 (-0.181349)	0.417791 / 0.434364 (-0.016573)	0.463639 / 0.540337 (-0.076698)	0.562787 / 1.386936 (-0.824149)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006594 / 0.011353 (-0.004759)	0.004086 / 0.011008 (-0.006922)	0.075122 / 0.038508 (0.036614)	0.041849 / 0.023109 (0.018740)	0.362645 / 0.275898 (0.086747)	0.464350 / 0.323480 (0.140870)	0.003760 / 0.007986 (-0.004226)	0.003327 / 0.004328 (-0.001001)	0.076154 / 0.004250 (0.071904)	0.053232 / 0.037052 (0.016180)	0.407863 / 0.258489 (0.149374)	0.460787 / 0.293841 (0.166946)	0.031917 / 0.128546 (-0.096630)	0.008770 / 0.075646 (-0.066876)	0.082612 / 0.419271 (-0.336660)	0.051311 / 0.043533 (0.007779)	0.354508 / 0.255139 (0.099369)	0.419533 / 0.283200 (0.136334)	0.023980 / 0.141683 (-0.117703)	1.491255 / 1.452155 (0.039100)	1.536101 / 1.492716 (0.043384)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.178261 / 0.018006 (0.160255)	0.444680 / 0.000490 (0.444190)	0.013761 / 0.000200 (0.013561)	0.000117 / 0.000054 (0.000063)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027875 / 0.037411 (-0.009536)	0.111269 / 0.014526 (0.096744)	0.121096 / 0.176557 (-0.055461)	0.174387 / 0.737135 (-0.562749)	0.124714 / 0.296338 (-0.171624)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.445422 / 0.215209 (0.230213)	4.435877 / 2.077655 (2.358222)	2.221895 / 1.504120 (0.717775)	2.030571 / 1.541195 (0.489376)	2.074863 / 1.468490 (0.606373)	0.543331 / 4.584777 (-4.041446)	3.753615 / 3.745712 (0.007903)	3.317074 / 5.269862 (-1.952787)	1.630390 / 4.565676 (-2.935286)	0.066726 / 0.424275 (-0.357549)	0.011556 / 0.007607 (0.003949)	0.546985 / 0.226044 (0.320941)	5.460634 / 2.268929 (3.191705)	2.705945 / 55.444624 (-52.738679)	2.373425 / 6.876477 (-4.503052)	2.401472 / 2.142072 (0.259399)	0.663225 / 4.805227 (-4.142002)	0.143694 / 6.500664 (-6.356970)	0.065283 / 0.075469 (-0.010186)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.264804 / 1.841788 (-0.576983)	14.803228 / 8.074308 (6.728919)	14.178514 / 10.191392 (3.987122)	0.162651 / 0.680424 (-0.517772)	0.017586 / 0.534201 (-0.516615)	0.398740 / 0.579283 (-0.180543)	0.414478 / 0.434364 (-0.019886)	0.465442 / 0.540337 (-0.074895)	0.563450 / 1.386936 (-0.823486)

lhoestq and others added 2 commits June 21, 2023 17:38

add filter_extensions

7a29650

test

0fd5b74

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

keep zip archives for imagefolder and audiofolder

b424648

lhoestq marked this pull request as ready for review June 21, 2023 17:20

lhoestq requested review from albertvillanova and mariosasko June 21, 2023 17:20

minor

67ca664

mariosasko reviewed Jun 21, 2023

View reviewed changes

lhoestq merged commit 76f75a9 into main Jun 22, 2023

lhoestq deleted the filter-extensions branch June 22, 2023 14:16

albertvillanova mentioned this pull request Sep 4, 2023

No-script datasets with ZIP files do not load #6207

Closed

Filter unsupported extensions #5972

Filter unsupported extensions #5972

Conversation

lhoestq commented Jun 21, 2023 • edited Loading

github-actions bot commented Jun 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jun 21, 2023 • edited Loading

github-actions bot commented Jun 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jun 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko left a comment

Choose a reason for hiding this comment

mariosasko Jun 21, 2023

Choose a reason for hiding this comment

lhoestq Jun 22, 2023 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Jun 22, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Jun 21, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 21, 2023 •

edited

Loading

lhoestq Jun 22, 2023 •

edited

Loading