Support streaming datasets with os.path.exists and Path.exists #5400

albertvillanova · 2023-01-03T07:42:37Z

Support streaming datasets with os.path.exists and pathlib.Path.exists.

HuggingFaceDocBuilderDev · 2023-01-03T07:47:14Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Nice thanks !

github-actions · 2023-01-06T10:42:43Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008638 / 0.011353 (-0.002715)	0.004565 / 0.011008 (-0.006444)	0.098984 / 0.038508 (0.060476)	0.030118 / 0.023109 (0.007009)	0.321779 / 0.275898 (0.045881)	0.366905 / 0.323480 (0.043426)	0.006931 / 0.007986 (-0.001055)	0.004728 / 0.004328 (0.000399)	0.078358 / 0.004250 (0.074108)	0.037755 / 0.037052 (0.000702)	0.312694 / 0.258489 (0.054205)	0.351781 / 0.293841 (0.057940)	0.033266 / 0.128546 (-0.095280)	0.011397 / 0.075646 (-0.064250)	0.323501 / 0.419271 (-0.095771)	0.040779 / 0.043533 (-0.002754)	0.303533 / 0.255139 (0.048394)	0.340940 / 0.283200 (0.057740)	0.088701 / 0.141683 (-0.052982)	1.472058 / 1.452155 (0.019904)	1.529535 / 1.492716 (0.036818)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.191803 / 0.018006 (0.173797)	0.409773 / 0.000490 (0.409283)	0.002704 / 0.000200 (0.002504)	0.000217 / 0.000054 (0.000163)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023520 / 0.037411 (-0.013891)	0.096967 / 0.014526 (0.082441)	0.107911 / 0.176557 (-0.068646)	0.146425 / 0.737135 (-0.590710)	0.109025 / 0.296338 (-0.187314)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.418565 / 0.215209 (0.203356)	4.183429 / 2.077655 (2.105774)	1.886534 / 1.504120 (0.382414)	1.689015 / 1.541195 (0.147820)	1.710757 / 1.468490 (0.242267)	0.693211 / 4.584777 (-3.891566)	3.380062 / 3.745712 (-0.365650)	2.619910 / 5.269862 (-2.649952)	1.457512 / 4.565676 (-3.108164)	0.082421 / 0.424275 (-0.341854)	0.012126 / 0.007607 (0.004519)	0.525249 / 0.226044 (0.299205)	5.244541 / 2.268929 (2.975613)	2.305908 / 55.444624 (-53.138717)	1.945298 / 6.876477 (-4.931178)	2.015618 / 2.142072 (-0.126455)	0.816746 / 4.805227 (-3.988481)	0.148325 / 6.500664 (-6.352339)	0.063939 / 0.075469 (-0.011530)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.255790 / 1.841788 (-0.585998)	13.433219 / 8.074308 (5.358911)	13.916957 / 10.191392 (3.725565)	0.153468 / 0.680424 (-0.526956)	0.028722 / 0.534201 (-0.505479)	0.398245 / 0.579283 (-0.181038)	0.399067 / 0.434364 (-0.035296)	0.457525 / 0.540337 (-0.082812)	0.542391 / 1.386936 (-0.844545)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006411 / 0.011353 (-0.004942)	0.004552 / 0.011008 (-0.006456)	0.098036 / 0.038508 (0.059527)	0.026532 / 0.023109 (0.003422)	0.412270 / 0.275898 (0.136372)	0.442771 / 0.323480 (0.119291)	0.004891 / 0.007986 (-0.003094)	0.003488 / 0.004328 (-0.000841)	0.075437 / 0.004250 (0.071186)	0.036228 / 0.037052 (-0.000824)	0.413246 / 0.258489 (0.154757)	0.453546 / 0.293841 (0.159705)	0.031054 / 0.128546 (-0.097492)	0.011589 / 0.075646 (-0.064058)	0.318477 / 0.419271 (-0.100794)	0.041075 / 0.043533 (-0.002457)	0.411182 / 0.255139 (0.156043)	0.436991 / 0.283200 (0.153792)	0.086563 / 0.141683 (-0.055120)	1.511948 / 1.452155 (0.059793)	1.570925 / 1.492716 (0.078208)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.200510 / 0.018006 (0.182504)	0.403450 / 0.000490 (0.402960)	0.000397 / 0.000200 (0.000197)	0.000058 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023950 / 0.037411 (-0.013461)	0.097334 / 0.014526 (0.082808)	0.105228 / 0.176557 (-0.071328)	0.137699 / 0.737135 (-0.599436)	0.107063 / 0.296338 (-0.189275)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.474420 / 0.215209 (0.259211)	4.748212 / 2.077655 (2.670557)	2.407318 / 1.504120 (0.903198)	2.198949 / 1.541195 (0.657755)	2.220377 / 1.468490 (0.751887)	0.704022 / 4.584777 (-3.880755)	3.366128 / 3.745712 (-0.379584)	1.839454 / 5.269862 (-3.430408)	1.151183 / 4.565676 (-3.414493)	0.082818 / 0.424275 (-0.341457)	0.012765 / 0.007607 (0.005158)	0.571913 / 0.226044 (0.345868)	5.722544 / 2.268929 (3.453615)	2.858279 / 55.444624 (-52.586346)	2.513479 / 6.876477 (-4.362998)	2.574227 / 2.142072 (0.432154)	0.803282 / 4.805227 (-4.001945)	0.150603 / 6.500664 (-6.350061)	0.066594 / 0.075469 (-0.008875)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.301161 / 1.841788 (-0.540627)	13.580745 / 8.074308 (5.506436)	13.301551 / 10.191392 (3.110159)	0.141424 / 0.680424 (-0.539000)	0.016579 / 0.534201 (-0.517622)	0.380726 / 0.579283 (-0.198557)	0.383011 / 0.434364 (-0.051353)	0.438717 / 0.540337 (-0.101620)	0.527085 / 1.386936 (-0.859851)

albertvillanova added 2 commits January 3, 2023 08:34

Test streaming path.exists

1e251dd

Implement streaming path.exists

0c630f3

lhoestq approved these changes Jan 5, 2023

View reviewed changes

albertvillanova merged commit 7c61e55 into huggingface:main Jan 6, 2023

albertvillanova deleted the stream-path-exists branch January 6, 2023 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support streaming datasets with os.path.exists and Path.exists #5400

Support streaming datasets with os.path.exists and Path.exists #5400

albertvillanova commented Jan 3, 2023

HuggingFaceDocBuilderDev commented Jan 3, 2023 •

edited

Loading

lhoestq left a comment

github-actions bot commented Jan 6, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Support streaming datasets with os.path.exists and Path.exists #5400

Support streaming datasets with os.path.exists and Path.exists #5400

Conversation

albertvillanova commented Jan 3, 2023

HuggingFaceDocBuilderDev commented Jan 3, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 6, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jan 3, 2023 •

edited

Loading