Fix get_data_patterns for directories with the word data twice #6309

albertvillanova · 2023-10-17T09:00:39Z

Before the fix, get_data_patterns inferred wrongly the split name for paths with the word "data" twice:

For the URL path: hf://datasets/piuba-bigdata/articles_and_comments@f328d536425ae8fcac5d098c8408f437bffdd357/data/train-00001-of-00009.parquet (note the org name piuba-bigdata/ ending with data/)
The inferred split name was: articles_and_comments@f328d536425ae8fcac5d098c8408f437bffdd357/data/train instead of train

This PR fixes this issue by passing the base_path (hf://datasets/piuba-bigdata/articles_and_comments@f328d536425ae8fcac5d098c8408f437bffdd357) to _get_data_files_patterns and prepending it to the regex split pattern (data/{split}-[0-9][0-9][0-9][0-9][0-9]-of-[0-9][0-9][0-9][0-9][0-9].*\\..*).

Fix #6305.
Fix https://huggingface.co/datasets/piuba-bigdata/articles_and_comments/discussions/1

github-actions · 2023-10-17T09:01:35Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006461 / 0.011353 (-0.004891)	0.004035 / 0.011008 (-0.006973)	0.085037 / 0.038508 (0.046529)	0.072434 / 0.023109 (0.049325)	0.308565 / 0.275898 (0.032667)	0.330455 / 0.323480 (0.006975)	0.003782 / 0.007986 (-0.004204)	0.004363 / 0.004328 (0.000034)	0.065242 / 0.004250 (0.060991)	0.056111 / 0.037052 (0.019058)	0.318008 / 0.258489 (0.059519)	0.357904 / 0.293841 (0.064063)	0.030702 / 0.128546 (-0.097844)	0.008741 / 0.075646 (-0.066905)	0.287666 / 0.419271 (-0.131605)	0.052281 / 0.043533 (0.008748)	0.306894 / 0.255139 (0.051755)	0.335739 / 0.283200 (0.052540)	0.023712 / 0.141683 (-0.117971)	1.492304 / 1.452155 (0.040149)	1.544540 / 1.492716 (0.051823)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.299419 / 0.018006 (0.281413)	0.547195 / 0.000490 (0.546705)	0.011571 / 0.000200 (0.011371)	0.000223 / 0.000054 (0.000168)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028364 / 0.037411 (-0.009048)	0.081445 / 0.014526 (0.066919)	0.626670 / 0.176557 (0.450114)	0.159964 / 0.737135 (-0.577171)	0.100528 / 0.296338 (-0.195811)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.409915 / 0.215209 (0.194705)	4.108689 / 2.077655 (2.031034)	2.046247 / 1.504120 (0.542127)	1.851081 / 1.541195 (0.309887)	1.857857 / 1.468490 (0.389367)	0.493246 / 4.584777 (-4.091531)	3.581557 / 3.745712 (-0.164155)	3.456708 / 5.269862 (-1.813153)	2.051054 / 4.565676 (-2.514623)	0.057553 / 0.424275 (-0.366722)	0.007287 / 0.007607 (-0.000320)	0.493094 / 0.226044 (0.267050)	4.873051 / 2.268929 (2.604122)	2.515266 / 55.444624 (-52.929358)	2.144743 / 6.876477 (-4.731733)	2.159412 / 2.142072 (0.017340)	0.595627 / 4.805227 (-4.209601)	0.133773 / 6.500664 (-6.366891)	0.059965 / 0.075469 (-0.015504)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.259625 / 1.841788 (-0.582163)	19.030742 / 8.074308 (10.956434)	14.039246 / 10.191392 (3.847854)	0.168116 / 0.680424 (-0.512308)	0.018168 / 0.534201 (-0.516033)	0.391187 / 0.579283 (-0.188096)	0.420901 / 0.434364 (-0.013463)	0.465827 / 0.540337 (-0.074511)	0.718373 / 1.386936 (-0.668563)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006616 / 0.011353 (-0.004737)	0.004048 / 0.011008 (-0.006960)	0.064568 / 0.038508 (0.026060)	0.075933 / 0.023109 (0.052824)	0.396353 / 0.275898 (0.120455)	0.424159 / 0.323480 (0.100679)	0.005446 / 0.007986 (-0.002540)	0.003393 / 0.004328 (-0.000935)	0.064673 / 0.004250 (0.060422)	0.056983 / 0.037052 (0.019930)	0.402478 / 0.258489 (0.143989)	0.433240 / 0.293841 (0.139399)	0.032100 / 0.128546 (-0.096446)	0.008664 / 0.075646 (-0.066983)	0.070502 / 0.419271 (-0.348770)	0.047800 / 0.043533 (0.004267)	0.399506 / 0.255139 (0.144367)	0.418376 / 0.283200 (0.135176)	0.022654 / 0.141683 (-0.119029)	1.487280 / 1.452155 (0.035125)	1.543733 / 1.492716 (0.051017)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.317660 / 0.018006 (0.299654)	0.523922 / 0.000490 (0.523432)	0.007086 / 0.000200 (0.006886)	0.000109 / 0.000054 (0.000055)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032381 / 0.037411 (-0.005030)	0.091636 / 0.014526 (0.077110)	0.104743 / 0.176557 (-0.071814)	0.158793 / 0.737135 (-0.578342)	0.103164 / 0.296338 (-0.193175)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434081 / 0.215209 (0.218872)	4.329448 / 2.077655 (2.251794)	2.335855 / 1.504120 (0.831735)	2.177513 / 1.541195 (0.636319)	2.205406 / 1.468490 (0.736916)	0.500117 / 4.584777 (-4.084660)	3.693715 / 3.745712 (-0.051997)	3.305803 / 5.269862 (-1.964059)	2.048283 / 4.565676 (-2.517394)	0.058301 / 0.424275 (-0.365974)	0.007196 / 0.007607 (-0.000411)	0.512917 / 0.226044 (0.286873)	5.129283 / 2.268929 (2.860355)	2.836200 / 55.444624 (-52.608425)	2.499022 / 6.876477 (-4.377455)	2.652305 / 2.142072 (0.510232)	0.604219 / 4.805227 (-4.201008)	0.137310 / 6.500664 (-6.363354)	0.060880 / 0.075469 (-0.014589)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.346948 / 1.841788 (-0.494839)	19.499516 / 8.074308 (11.425208)	14.701500 / 10.191392 (4.510108)	0.168626 / 0.680424 (-0.511798)	0.020002 / 0.534201 (-0.514199)	0.394729 / 0.579283 (-0.184554)	0.428323 / 0.434364 (-0.006040)	0.481202 / 0.540337 (-0.059136)	0.684768 / 1.386936 (-0.702169)

HuggingFaceDocBuilderDev · 2023-10-17T09:06:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

github-actions · 2023-10-18T08:08:41Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007033 / 0.011353 (-0.004320)	0.004411 / 0.011008 (-0.006597)	0.086146 / 0.038508 (0.047638)	0.086669 / 0.023109 (0.063560)	0.329145 / 0.275898 (0.053247)	0.348728 / 0.323480 (0.025248)	0.004404 / 0.007986 (-0.003582)	0.003656 / 0.004328 (-0.000673)	0.066120 / 0.004250 (0.061869)	0.059157 / 0.037052 (0.022105)	0.316537 / 0.258489 (0.058048)	0.369065 / 0.293841 (0.075224)	0.031921 / 0.128546 (-0.096625)	0.008877 / 0.075646 (-0.066770)	0.290068 / 0.419271 (-0.129204)	0.054007 / 0.043533 (0.010475)	0.308823 / 0.255139 (0.053684)	0.331189 / 0.283200 (0.047989)	0.027313 / 0.141683 (-0.114370)	1.486772 / 1.452155 (0.034617)	1.570359 / 1.492716 (0.077643)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.315991 / 0.018006 (0.297985)	0.577876 / 0.000490 (0.577386)	0.011207 / 0.000200 (0.011007)	0.000089 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031753 / 0.037411 (-0.005658)	0.089270 / 0.014526 (0.074744)	0.102518 / 0.176557 (-0.074038)	0.160260 / 0.737135 (-0.576875)	0.103365 / 0.296338 (-0.192973)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.405789 / 0.215209 (0.190580)	4.052740 / 2.077655 (1.975085)	2.052076 / 1.504120 (0.547956)	1.873966 / 1.541195 (0.332771)	1.997156 / 1.468490 (0.528665)	0.494975 / 4.584777 (-4.089802)	3.600007 / 3.745712 (-0.145705)	3.626459 / 5.269862 (-1.643403)	2.176927 / 4.565676 (-2.388750)	0.057894 / 0.424275 (-0.366381)	0.007469 / 0.007607 (-0.000138)	0.487422 / 0.226044 (0.261377)	4.868744 / 2.268929 (2.599815)	2.528707 / 55.444624 (-52.915918)	2.149520 / 6.876477 (-4.726956)	2.275491 / 2.142072 (0.133419)	0.589112 / 4.805227 (-4.216115)	0.136644 / 6.500664 (-6.364020)	0.062144 / 0.075469 (-0.013325)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.286625 / 1.841788 (-0.555163)	20.528128 / 8.074308 (12.453819)	15.290866 / 10.191392 (5.099474)	0.168380 / 0.680424 (-0.512044)	0.018908 / 0.534201 (-0.515293)	0.397210 / 0.579283 (-0.182073)	0.426133 / 0.434364 (-0.008231)	0.471754 / 0.540337 (-0.068584)	0.653343 / 1.386936 (-0.733593)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007599 / 0.011353 (-0.003754)	0.004499 / 0.011008 (-0.006509)	0.066248 / 0.038508 (0.027740)	0.097704 / 0.023109 (0.074595)	0.414558 / 0.275898 (0.138660)	0.451088 / 0.323480 (0.127609)	0.005932 / 0.007986 (-0.002054)	0.003698 / 0.004328 (-0.000630)	0.065784 / 0.004250 (0.061534)	0.064777 / 0.037052 (0.027725)	0.443318 / 0.258489 (0.184829)	0.456896 / 0.293841 (0.163055)	0.033436 / 0.128546 (-0.095111)	0.008977 / 0.075646 (-0.066669)	0.072067 / 0.419271 (-0.347205)	0.049571 / 0.043533 (0.006038)	0.420325 / 0.255139 (0.165186)	0.443588 / 0.283200 (0.160388)	0.026723 / 0.141683 (-0.114960)	1.512566 / 1.452155 (0.060411)	1.647591 / 1.492716 (0.154875)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.326410 / 0.018006 (0.308404)	0.532878 / 0.000490 (0.532388)	0.006257 / 0.000200 (0.006057)	0.000104 / 0.000054 (0.000049)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037289 / 0.037411 (-0.000122)	0.104940 / 0.014526 (0.090414)	0.113597 / 0.176557 (-0.062960)	0.170562 / 0.737135 (-0.566573)	0.114583 / 0.296338 (-0.181755)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.435530 / 0.215209 (0.220321)	4.332659 / 2.077655 (2.255005)	2.343576 / 1.504120 (0.839456)	2.190517 / 1.541195 (0.649322)	2.323101 / 1.468490 (0.854611)	0.493019 / 4.584777 (-4.091758)	3.686726 / 3.745712 (-0.058986)	3.437143 / 5.269862 (-1.832719)	2.167193 / 4.565676 (-2.398483)	0.059636 / 0.424275 (-0.364639)	0.007696 / 0.007607 (0.000089)	0.511159 / 0.226044 (0.285115)	5.119358 / 2.268929 (2.850429)	2.814934 / 55.444624 (-52.629690)	2.477871 / 6.876477 (-4.398606)	2.774473 / 2.142072 (0.632401)	0.590258 / 4.805227 (-4.214969)	0.135923 / 6.500664 (-6.364741)	0.062793 / 0.075469 (-0.012676)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.350192 / 1.841788 (-0.491596)	21.382135 / 8.074308 (13.307827)	16.024198 / 10.191392 (5.832806)	0.163623 / 0.680424 (-0.516801)	0.020749 / 0.534201 (-0.513452)	0.402578 / 0.579283 (-0.176705)	0.436569 / 0.434364 (0.002205)	0.477217 / 0.540337 (-0.063121)	0.682929 / 1.386936 (-0.704007)

github-actions · 2023-10-18T08:50:56Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006671 / 0.011353 (-0.004681)	0.004176 / 0.011008 (-0.006832)	0.084095 / 0.038508 (0.045587)	0.076345 / 0.023109 (0.053236)	0.341201 / 0.275898 (0.065303)	0.381920 / 0.323480 (0.058440)	0.005578 / 0.007986 (-0.002408)	0.003535 / 0.004328 (-0.000794)	0.065227 / 0.004250 (0.060976)	0.054983 / 0.037052 (0.017931)	0.345938 / 0.258489 (0.087449)	0.398708 / 0.293841 (0.104867)	0.031029 / 0.128546 (-0.097518)	0.008643 / 0.075646 (-0.067004)	0.287286 / 0.419271 (-0.131985)	0.052424 / 0.043533 (0.008892)	0.342914 / 0.255139 (0.087775)	0.366982 / 0.283200 (0.083782)	0.024511 / 0.141683 (-0.117172)	1.510575 / 1.452155 (0.058421)	1.593214 / 1.492716 (0.100497)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.272703 / 0.018006 (0.254697)	0.583235 / 0.000490 (0.582746)	0.008467 / 0.000200 (0.008267)	0.000295 / 0.000054 (0.000240)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029654 / 0.037411 (-0.007757)	0.085078 / 0.014526 (0.070552)	0.106391 / 0.176557 (-0.070165)	0.155790 / 0.737135 (-0.581345)	0.104835 / 0.296338 (-0.191503)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.408584 / 0.215209 (0.193375)	4.082557 / 2.077655 (2.004902)	2.054001 / 1.504120 (0.549881)	1.868470 / 1.541195 (0.327275)	1.950600 / 1.468490 (0.482110)	0.492572 / 4.584777 (-4.092205)	3.497105 / 3.745712 (-0.248607)	3.464596 / 5.269862 (-1.805265)	2.106399 / 4.565676 (-2.459278)	0.057413 / 0.424275 (-0.366862)	0.007449 / 0.007607 (-0.000158)	0.482900 / 0.226044 (0.256856)	4.844152 / 2.268929 (2.575223)	2.499930 / 55.444624 (-52.944695)	2.180396 / 6.876477 (-4.696081)	2.282830 / 2.142072 (0.140758)	0.581371 / 4.805227 (-4.223857)	0.134641 / 6.500664 (-6.366023)	0.063137 / 0.075469 (-0.012332)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.274291 / 1.841788 (-0.567496)	19.426189 / 8.074308 (11.351881)	14.292833 / 10.191392 (4.101441)	0.166321 / 0.680424 (-0.514102)	0.018419 / 0.534201 (-0.515782)	0.392433 / 0.579283 (-0.186850)	0.415128 / 0.434364 (-0.019236)	0.459274 / 0.540337 (-0.081063)	0.714668 / 1.386936 (-0.672268)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006740 / 0.011353 (-0.004613)	0.004283 / 0.011008 (-0.006725)	0.063845 / 0.038508 (0.025337)	0.077037 / 0.023109 (0.053927)	0.425103 / 0.275898 (0.149205)	0.445525 / 0.323480 (0.122046)	0.005755 / 0.007986 (-0.002230)	0.003589 / 0.004328 (-0.000739)	0.064515 / 0.004250 (0.060265)	0.057398 / 0.037052 (0.020346)	0.424781 / 0.258489 (0.166292)	0.452162 / 0.293841 (0.158321)	0.032164 / 0.128546 (-0.096382)	0.008660 / 0.075646 (-0.066986)	0.069873 / 0.419271 (-0.349399)	0.048100 / 0.043533 (0.004567)	0.409097 / 0.255139 (0.153958)	0.441533 / 0.283200 (0.158333)	0.024122 / 0.141683 (-0.117560)	1.503431 / 1.452155 (0.051277)	1.577518 / 1.492716 (0.084802)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.264433 / 0.018006 (0.246426)	0.553631 / 0.000490 (0.553141)	0.006354 / 0.000200 (0.006154)	0.000106 / 0.000054 (0.000051)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033259 / 0.037411 (-0.004152)	0.094908 / 0.014526 (0.080382)	0.108238 / 0.176557 (-0.068318)	0.161354 / 0.737135 (-0.575781)	0.109073 / 0.296338 (-0.187265)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434450 / 0.215209 (0.219241)	4.347501 / 2.077655 (2.269847)	2.362225 / 1.504120 (0.858105)	2.189285 / 1.541195 (0.648090)	2.288797 / 1.468490 (0.820307)	0.487782 / 4.584777 (-4.096995)	3.598732 / 3.745712 (-0.146980)	3.343263 / 5.269862 (-1.926599)	2.086256 / 4.565676 (-2.479420)	0.057838 / 0.424275 (-0.366437)	0.007412 / 0.007607 (-0.000195)	0.510098 / 0.226044 (0.284054)	5.088743 / 2.268929 (2.819814)	2.809105 / 55.444624 (-52.635519)	2.476005 / 6.876477 (-4.400471)	2.753785 / 2.142072 (0.611712)	0.585045 / 4.805227 (-4.220182)	0.131162 / 6.500664 (-6.369502)	0.060431 / 0.075469 (-0.015038)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.342149 / 1.841788 (-0.499639)	20.602369 / 8.074308 (12.528061)	14.973301 / 10.191392 (4.781909)	0.151655 / 0.680424 (-0.528769)	0.020793 / 0.534201 (-0.513408)	0.401657 / 0.579283 (-0.177626)	0.419845 / 0.434364 (-0.014519)	0.467225 / 0.540337 (-0.073113)	0.672469 / 1.386936 (-0.714467)

github-actions · 2023-10-18T11:26:50Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007006 / 0.011353 (-0.004346)	0.004282 / 0.011008 (-0.006726)	0.085413 / 0.038508 (0.046905)	0.085148 / 0.023109 (0.062038)	0.336543 / 0.275898 (0.060645)	0.367959 / 0.323480 (0.044479)	0.004337 / 0.007986 (-0.003648)	0.004535 / 0.004328 (0.000207)	0.065379 / 0.004250 (0.061128)	0.059993 / 0.037052 (0.022941)	0.343162 / 0.258489 (0.084673)	0.383766 / 0.293841 (0.089925)	0.031520 / 0.128546 (-0.097026)	0.008605 / 0.075646 (-0.067042)	0.288620 / 0.419271 (-0.130651)	0.053617 / 0.043533 (0.010084)	0.339389 / 0.255139 (0.084250)	0.350842 / 0.283200 (0.067642)	0.027816 / 0.141683 (-0.113867)	1.505500 / 1.452155 (0.053346)	1.566511 / 1.492716 (0.073795)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.272203 / 0.018006 (0.254197)	0.569729 / 0.000490 (0.569240)	0.010061 / 0.000200 (0.009861)	0.000328 / 0.000054 (0.000273)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030015 / 0.037411 (-0.007396)	0.083991 / 0.014526 (0.069465)	0.099796 / 0.176557 (-0.076761)	0.159131 / 0.737135 (-0.578004)	0.099102 / 0.296338 (-0.197237)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.390076 / 0.215209 (0.174867)	3.897157 / 2.077655 (1.819502)	1.935912 / 1.504120 (0.431793)	1.815109 / 1.541195 (0.273915)	1.875041 / 1.468490 (0.406551)	0.482168 / 4.584777 (-4.102609)	3.556140 / 3.745712 (-0.189572)	3.528889 / 5.269862 (-1.740972)	2.132767 / 4.565676 (-2.432909)	0.057761 / 0.424275 (-0.366514)	0.007353 / 0.007607 (-0.000254)	0.464801 / 0.226044 (0.238757)	4.637301 / 2.268929 (2.368372)	2.362239 / 55.444624 (-53.082386)	2.049811 / 6.876477 (-4.826665)	2.143485 / 2.142072 (0.001412)	0.580929 / 4.805227 (-4.224299)	0.140252 / 6.500664 (-6.360412)	0.061352 / 0.075469 (-0.014117)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.257487 / 1.841788 (-0.584301)	19.453319 / 8.074308 (11.379011)	14.276332 / 10.191392 (4.084940)	0.166772 / 0.680424 (-0.513652)	0.018339 / 0.534201 (-0.515862)	0.393008 / 0.579283 (-0.186275)	0.420960 / 0.434364 (-0.013404)	0.464331 / 0.540337 (-0.076007)	0.717973 / 1.386936 (-0.668963)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007255 / 0.011353 (-0.004098)	0.004230 / 0.011008 (-0.006778)	0.065191 / 0.038508 (0.026683)	0.085765 / 0.023109 (0.062655)	0.412464 / 0.275898 (0.136566)	0.446067 / 0.323480 (0.122587)	0.005875 / 0.007986 (-0.002110)	0.003700 / 0.004328 (-0.000628)	0.065430 / 0.004250 (0.061179)	0.060284 / 0.037052 (0.023231)	0.419984 / 0.258489 (0.161495)	0.453779 / 0.293841 (0.159938)	0.032595 / 0.128546 (-0.095952)	0.008873 / 0.075646 (-0.066773)	0.072124 / 0.419271 (-0.347148)	0.048072 / 0.043533 (0.004539)	0.408725 / 0.255139 (0.153586)	0.432485 / 0.283200 (0.149285)	0.024662 / 0.141683 (-0.117021)	1.540434 / 1.452155 (0.088279)	1.624768 / 1.492716 (0.132051)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.253220 / 0.018006 (0.235214)	0.555469 / 0.000490 (0.554980)	0.007765 / 0.000200 (0.007565)	0.000101 / 0.000054 (0.000046)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032666 / 0.037411 (-0.004745)	0.094786 / 0.014526 (0.080260)	0.108219 / 0.176557 (-0.068337)	0.161546 / 0.737135 (-0.575589)	0.109828 / 0.296338 (-0.186510)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.437024 / 0.215209 (0.221815)	4.354065 / 2.077655 (2.276411)	2.336832 / 1.504120 (0.832713)	2.161959 / 1.541195 (0.620764)	2.257214 / 1.468490 (0.788724)	0.501576 / 4.584777 (-4.083201)	3.654292 / 3.745712 (-0.091420)	3.349504 / 5.269862 (-1.920357)	2.092998 / 4.565676 (-2.472679)	0.058740 / 0.424275 (-0.365535)	0.007420 / 0.007607 (-0.000187)	0.513443 / 0.226044 (0.287399)	5.151247 / 2.268929 (2.882319)	2.816036 / 55.444624 (-52.628589)	2.451863 / 6.876477 (-4.424613)	2.709908 / 2.142072 (0.567836)	0.597834 / 4.805227 (-4.207394)	0.136547 / 6.500664 (-6.364117)	0.062030 / 0.075469 (-0.013439)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.371412 / 1.841788 (-0.470375)	20.398981 / 8.074308 (12.324673)	14.932307 / 10.191392 (4.740915)	0.167796 / 0.680424 (-0.512628)	0.020740 / 0.534201 (-0.513461)	0.397162 / 0.579283 (-0.182121)	0.435493 / 0.434364 (0.001129)	0.477074 / 0.540337 (-0.063264)	0.697546 / 1.386936 (-0.689390)

github-actions · 2023-10-18T12:25:24Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007388 / 0.011353 (-0.003964)	0.004408 / 0.011008 (-0.006600)	0.098225 / 0.038508 (0.059717)	0.079368 / 0.023109 (0.056259)	0.381866 / 0.275898 (0.105968)	0.425942 / 0.323480 (0.102462)	0.005978 / 0.007986 (-0.002007)	0.003677 / 0.004328 (-0.000651)	0.075488 / 0.004250 (0.071238)	0.061725 / 0.037052 (0.024672)	0.389126 / 0.258489 (0.130637)	0.444099 / 0.293841 (0.150258)	0.036222 / 0.128546 (-0.092324)	0.009926 / 0.075646 (-0.065720)	0.336632 / 0.419271 (-0.082640)	0.060867 / 0.043533 (0.017335)	0.385437 / 0.255139 (0.130298)	0.416599 / 0.283200 (0.133399)	0.025118 / 0.141683 (-0.116565)	1.728073 / 1.452155 (0.275919)	1.847750 / 1.492716 (0.355033)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.263774 / 0.018006 (0.245768)	0.491242 / 0.000490 (0.490752)	0.013621 / 0.000200 (0.013421)	0.000333 / 0.000054 (0.000279)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032911 / 0.037411 (-0.004500)	0.095738 / 0.014526 (0.081212)	0.110482 / 0.176557 (-0.066075)	0.175533 / 0.737135 (-0.561603)	0.109240 / 0.296338 (-0.187098)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.453967 / 0.215209 (0.238758)	4.489384 / 2.077655 (2.411730)	2.185496 / 1.504120 (0.681376)	1.979126 / 1.541195 (0.437931)	2.016364 / 1.468490 (0.547874)	0.565539 / 4.584777 (-4.019238)	4.106561 / 3.745712 (0.360849)	3.906402 / 5.269862 (-1.363460)	2.342186 / 4.565676 (-2.223491)	0.067815 / 0.424275 (-0.356460)	0.008663 / 0.007607 (0.001056)	0.543841 / 0.226044 (0.317796)	5.433491 / 2.268929 (3.164563)	2.785723 / 55.444624 (-52.658901)	2.355716 / 6.876477 (-4.520760)	2.397563 / 2.142072 (0.255491)	0.682587 / 4.805227 (-4.122641)	0.156548 / 6.500664 (-6.344116)	0.070654 / 0.075469 (-0.004815)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.475183 / 1.841788 (-0.366605)	21.353030 / 8.074308 (13.278722)	15.938324 / 10.191392 (5.746932)	0.167010 / 0.680424 (-0.513413)	0.020931 / 0.534201 (-0.513270)	0.464376 / 0.579283 (-0.114907)	0.472546 / 0.434364 (0.038182)	0.544645 / 0.540337 (0.004308)	0.752940 / 1.386936 (-0.633996)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007359 / 0.011353 (-0.003994)	0.004276 / 0.011008 (-0.006732)	0.075345 / 0.038508 (0.036837)	0.080105 / 0.023109 (0.056995)	0.480456 / 0.275898 (0.204558)	0.514974 / 0.323480 (0.191494)	0.006087 / 0.007986 (-0.001899)	0.003717 / 0.004328 (-0.000611)	0.075067 / 0.004250 (0.070816)	0.063739 / 0.037052 (0.026686)	0.487569 / 0.258489 (0.229080)	0.530198 / 0.293841 (0.236357)	0.036056 / 0.128546 (-0.092491)	0.009606 / 0.075646 (-0.066041)	0.082343 / 0.419271 (-0.336929)	0.055488 / 0.043533 (0.011956)	0.484789 / 0.255139 (0.229650)	0.501918 / 0.283200 (0.218718)	0.025340 / 0.141683 (-0.116342)	1.784417 / 1.452155 (0.332262)	1.854202 / 1.492716 (0.361486)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.252476 / 0.018006 (0.234470)	0.484967 / 0.000490 (0.484478)	0.005471 / 0.000200 (0.005271)	0.000111 / 0.000054 (0.000057)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.037084 / 0.037411 (-0.000327)	0.106648 / 0.014526 (0.092122)	0.123393 / 0.176557 (-0.053164)	0.183088 / 0.737135 (-0.554047)	0.122572 / 0.296338 (-0.173767)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.516003 / 0.215209 (0.300793)	5.107748 / 2.077655 (3.030093)	2.778044 / 1.504120 (1.273924)	2.589944 / 1.541195 (1.048749)	2.649921 / 1.468490 (1.181431)	0.572783 / 4.584777 (-4.011994)	4.211331 / 3.745712 (0.465619)	3.738859 / 5.269862 (-1.531003)	2.331628 / 4.565676 (-2.234048)	0.067347 / 0.424275 (-0.356928)	0.008513 / 0.007607 (0.000905)	0.601056 / 0.226044 (0.375012)	5.990921 / 2.268929 (3.721992)	3.311544 / 55.444624 (-52.133081)	2.929850 / 6.876477 (-3.946627)	3.118741 / 2.142072 (0.976669)	0.685975 / 4.805227 (-4.119253)	0.155105 / 6.500664 (-6.345559)	0.069629 / 0.075469 (-0.005840)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.602367 / 1.841788 (-0.239421)	22.577072 / 8.074308 (14.502764)	17.049655 / 10.191392 (6.858263)	0.182412 / 0.680424 (-0.498011)	0.023137 / 0.534201 (-0.511064)	0.466988 / 0.579283 (-0.112295)	0.483887 / 0.434364 (0.049523)	0.556099 / 0.540337 (0.015761)	0.798332 / 1.386936 (-0.588604)

mariosasko

Thanks for the fix!

github-actions · 2023-10-18T14:01:52Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009086 / 0.011353 (-0.002267)	0.004755 / 0.011008 (-0.006253)	0.128866 / 0.038508 (0.090358)	0.086099 / 0.023109 (0.062990)	0.378079 / 0.275898 (0.102181)	0.487431 / 0.323480 (0.163951)	0.004712 / 0.007986 (-0.003274)	0.003622 / 0.004328 (-0.000706)	0.081214 / 0.004250 (0.076963)	0.057226 / 0.037052 (0.020174)	0.407655 / 0.258489 (0.149166)	0.448630 / 0.293841 (0.154789)	0.049051 / 0.128546 (-0.079495)	0.014537 / 0.075646 (-0.061110)	0.467343 / 0.419271 (0.048071)	0.070482 / 0.043533 (0.026949)	0.379664 / 0.255139 (0.124525)	0.464181 / 0.283200 (0.180981)	0.039973 / 0.141683 (-0.101710)	1.731164 / 1.452155 (0.279010)	1.886895 / 1.492716 (0.394178)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.251327 / 0.018006 (0.233321)	0.502670 / 0.000490 (0.502180)	0.012183 / 0.000200 (0.011984)	0.000111 / 0.000054 (0.000057)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028892 / 0.037411 (-0.008519)	0.093789 / 0.014526 (0.079263)	0.104255 / 0.176557 (-0.072301)	0.170257 / 0.737135 (-0.566879)	0.115430 / 0.296338 (-0.180909)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.573745 / 0.215209 (0.358536)	5.873732 / 2.077655 (3.796077)	2.485188 / 1.504120 (0.981068)	2.018476 / 1.541195 (0.477282)	2.062765 / 1.468490 (0.594275)	0.913816 / 4.584777 (-3.670961)	5.362338 / 3.745712 (1.616626)	4.698758 / 5.269862 (-0.571103)	3.132973 / 4.565676 (-1.432703)	0.093594 / 0.424275 (-0.330681)	0.008359 / 0.007607 (0.000751)	0.693997 / 0.226044 (0.467953)	7.042645 / 2.268929 (4.773717)	3.196180 / 55.444624 (-52.248445)	2.384585 / 6.876477 (-4.491892)	2.301256 / 2.142072 (0.159183)	1.048025 / 4.805227 (-3.757202)	0.206931 / 6.500664 (-6.293733)	0.069401 / 0.075469 (-0.006068)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.598898 / 1.841788 (-0.242889)	22.963667 / 8.074308 (14.889359)	20.373688 / 10.191392 (10.182296)	0.239716 / 0.680424 (-0.440707)	0.040213 / 0.534201 (-0.493988)	0.503268 / 0.579283 (-0.076015)	0.630750 / 0.434364 (0.196386)	0.578007 / 0.540337 (0.037669)	0.789564 / 1.386936 (-0.597372)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009129 / 0.011353 (-0.002224)	0.005453 / 0.011008 (-0.005555)	0.101040 / 0.038508 (0.062532)	0.099172 / 0.023109 (0.076062)	0.508453 / 0.275898 (0.232555)	0.570858 / 0.323480 (0.247378)	0.006584 / 0.007986 (-0.001401)	0.003800 / 0.004328 (-0.000528)	0.094349 / 0.004250 (0.090098)	0.064642 / 0.037052 (0.027590)	0.563008 / 0.258489 (0.304518)	0.625560 / 0.293841 (0.331719)	0.050121 / 0.128546 (-0.078426)	0.014183 / 0.075646 (-0.061463)	0.106564 / 0.419271 (-0.312707)	0.061030 / 0.043533 (0.017498)	0.522311 / 0.255139 (0.267172)	0.598356 / 0.283200 (0.315156)	0.042008 / 0.141683 (-0.099675)	1.879999 / 1.452155 (0.427844)	1.963879 / 1.492716 (0.471162)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.270573 / 0.018006 (0.252567)	0.554356 / 0.000490 (0.553866)	0.008145 / 0.000200 (0.007945)	0.000218 / 0.000054 (0.000163)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031089 / 0.037411 (-0.006322)	0.099568 / 0.014526 (0.085043)	0.118304 / 0.176557 (-0.058253)	0.182991 / 0.737135 (-0.554144)	0.115874 / 0.296338 (-0.180465)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.615020 / 0.215209 (0.399811)	6.279740 / 2.077655 (4.202085)	2.882094 / 1.504120 (1.377974)	2.559265 / 1.541195 (1.018070)	2.639259 / 1.468490 (1.170769)	0.903727 / 4.584777 (-3.681050)	5.248555 / 3.745712 (1.502843)	4.817340 / 5.269862 (-0.452522)	3.056880 / 4.565676 (-1.508797)	0.096602 / 0.424275 (-0.327673)	0.008660 / 0.007607 (0.001053)	0.794347 / 0.226044 (0.568303)	7.625127 / 2.268929 (5.356198)	3.766826 / 55.444624 (-51.677798)	2.968254 / 6.876477 (-3.908223)	3.260595 / 2.142072 (1.118523)	1.066228 / 4.805227 (-3.739000)	0.207158 / 6.500664 (-6.293506)	0.076920 / 0.075469 (0.001451)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.741442 / 1.841788 (-0.100345)	23.499552 / 8.074308 (15.425244)	22.064966 / 10.191392 (11.873574)	0.239173 / 0.680424 (-0.441251)	0.032105 / 0.534201 (-0.502096)	0.484709 / 0.579283 (-0.094574)	0.583632 / 0.434364 (0.149268)	0.569018 / 0.540337 (0.028681)	0.815764 / 1.386936 (-0.571172)

* Test get_data_patterns from directory with the word data twice * Fix get_data_patterns * Use glob_pattern_to_regex in entire xjoin * Fix test by passing base_path as posix * Use slash instead of xjoin for data files patterns * Fix slash sep

albertvillanova added 2 commits October 17, 2023 10:51

Test get_data_patterns from directory with the word data twice

30d2c2e

Fix get_data_patterns

fed9c07

albertvillanova mentioned this pull request Oct 17, 2023

Cannot load dataset with 2.14.5: FileNotFound error #6305

Closed

albertvillanova changed the title ~~Fix get_data_patterns for direcotries with the word data twice~~ Fix get_data_patterns for directories with the word data twice Oct 17, 2023

Use glob_pattern_to_regex in entire xjoin

fa36173

Fix test by passing base_path as posix

474beaf

Use slash instead of xjoin for data files patterns

017cefb

Fix slash sep

3e6d831

mariosasko approved these changes Oct 18, 2023

View reviewed changes

albertvillanova merged commit 3aeb078 into main Oct 18, 2023
13 checks passed

albertvillanova deleted the fix-6305 branch October 18, 2023 13:50

ZachNagengast mentioned this pull request Oct 19, 2023

Fix regex get_data_files formatting for base paths #6322

Merged

Fix get_data_patterns for directories with the word data twice #6309

Fix get_data_patterns for directories with the word data twice #6309

Conversation

albertvillanova commented Oct 17, 2023 • edited Loading

github-actions bot commented Oct 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Oct 17, 2023

github-actions bot commented Oct 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Oct 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Oct 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Oct 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 18, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented Oct 17, 2023 •

edited

Loading