Preserve `stopping_strategy` of shuffled interleaved dataset (random cycling case) #5816

mariosasko · 2023-05-03T18:34:18Z

Preserve the stopping_strategy in the RandomlyCyclingMultiSourcesExamplesIterable.shard_data_sources to fix shuffling a dataset interleaved (from multiple sources) with probabilities.

Fix #5812

…case)

github-actions · 2023-05-03T18:38:32Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007862 / 0.011353 (-0.003491)	0.005747 / 0.011008 (-0.005261)	0.106818 / 0.038508 (0.068310)	0.036630 / 0.023109 (0.013521)	0.344218 / 0.275898 (0.068320)	0.398803 / 0.323480 (0.075324)	0.006187 / 0.007986 (-0.001799)	0.005686 / 0.004328 (0.001358)	0.078568 / 0.004250 (0.074318)	0.051786 / 0.037052 (0.014734)	0.361736 / 0.258489 (0.103247)	0.396323 / 0.293841 (0.102482)	0.037943 / 0.128546 (-0.090603)	0.013957 / 0.075646 (-0.061689)	0.366782 / 0.419271 (-0.052490)	0.054700 / 0.043533 (0.011167)	0.349692 / 0.255139 (0.094553)	0.366481 / 0.283200 (0.083281)	0.117394 / 0.141683 (-0.024289)	1.593156 / 1.452155 (0.141001)	1.708864 / 1.492716 (0.216148)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.229529 / 0.018006 (0.211523)	0.490531 / 0.000490 (0.490042)	0.002934 / 0.000200 (0.002734)	0.000094 / 0.000054 (0.000040)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028074 / 0.037411 (-0.009337)	0.122321 / 0.014526 (0.107795)	0.129120 / 0.176557 (-0.047436)	0.188413 / 0.737135 (-0.548722)	0.138983 / 0.296338 (-0.157355)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.479350 / 0.215209 (0.264141)	4.926201 / 2.077655 (2.848546)	2.265557 / 1.504120 (0.761437)	2.014580 / 1.541195 (0.473386)	2.120517 / 1.468490 (0.652027)	0.795334 / 4.584777 (-3.789443)	4.509754 / 3.745712 (0.764042)	4.328313 / 5.269862 (-0.941548)	2.153304 / 4.565676 (-2.412373)	0.102942 / 0.424275 (-0.321333)	0.053504 / 0.007607 (0.045896)	0.609392 / 0.226044 (0.383347)	6.114048 / 2.268929 (3.845119)	2.773306 / 55.444624 (-52.671318)	2.443434 / 6.876477 (-4.433042)	2.612005 / 2.142072 (0.469932)	0.950435 / 4.805227 (-3.854792)	0.194081 / 6.500664 (-6.306583)	0.074513 / 0.075469 (-0.000956)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.402897 / 1.841788 (-0.438891)	18.263033 / 8.074308 (10.188724)	16.579809 / 10.191392 (6.388417)	0.212319 / 0.680424 (-0.468104)	0.020468 / 0.534201 (-0.513733)	0.494850 / 0.579283 (-0.084433)	0.483790 / 0.434364 (0.049426)	0.572073 / 0.540337 (0.031735)	0.684353 / 1.386936 (-0.702583)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009732 / 0.011353 (-0.001621)	0.005901 / 0.011008 (-0.005107)	0.084568 / 0.038508 (0.046060)	0.038743 / 0.023109 (0.015634)	0.431323 / 0.275898 (0.155425)	0.472124 / 0.323480 (0.148644)	0.006255 / 0.007986 (-0.001731)	0.005892 / 0.004328 (0.001563)	0.081913 / 0.004250 (0.077662)	0.055560 / 0.037052 (0.018507)	0.442857 / 0.258489 (0.184368)	0.481887 / 0.293841 (0.188046)	0.040730 / 0.128546 (-0.087816)	0.014339 / 0.075646 (-0.061307)	0.099258 / 0.419271 (-0.320013)	0.054692 / 0.043533 (0.011159)	0.436323 / 0.255139 (0.181184)	0.461046 / 0.283200 (0.177846)	0.125972 / 0.141683 (-0.015710)	1.673173 / 1.452155 (0.221018)	1.781364 / 1.492716 (0.288648)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.271450 / 0.018006 (0.253444)	0.514484 / 0.000490 (0.513994)	0.000455 / 0.000200 (0.000255)	0.000061 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036104 / 0.037411 (-0.001308)	0.143306 / 0.014526 (0.128780)	0.151105 / 0.176557 (-0.025451)	0.210737 / 0.737135 (-0.526399)	0.151404 / 0.296338 (-0.144934)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.573613 / 0.215209 (0.358404)	5.828222 / 2.077655 (3.750567)	2.993028 / 1.504120 (1.488908)	2.617900 / 1.541195 (1.076706)	2.754673 / 1.468490 (1.286183)	1.010624 / 4.584777 (-3.574152)	4.971261 / 3.745712 (1.225549)	4.382017 / 5.269862 (-0.887845)	1.971894 / 4.565676 (-2.593782)	0.104404 / 0.424275 (-0.319871)	0.014595 / 0.007607 (0.006988)	0.657684 / 0.226044 (0.431639)	6.566151 / 2.268929 (4.297222)	3.221378 / 55.444624 (-52.223246)	2.809402 / 6.876477 (-4.067075)	2.882426 / 2.142072 (0.740354)	1.006134 / 4.805227 (-3.799093)	0.204469 / 6.500664 (-6.296196)	0.078147 / 0.075469 (0.002678)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.574768 / 1.841788 (-0.267020)	18.193335 / 8.074308 (10.119027)	17.275353 / 10.191392 (7.083961)	0.166890 / 0.680424 (-0.513534)	0.020612 / 0.534201 (-0.513589)	0.496179 / 0.579283 (-0.083104)	0.507824 / 0.434364 (0.073460)	0.620984 / 0.540337 (0.080647)	0.749727 / 1.386936 (-0.637209)

HuggingFaceDocBuilderDev · 2023-05-03T18:38:38Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

good catch :)

github-actions · 2023-05-04T14:31:55Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006534 / 0.011353 (-0.004819)	0.004456 / 0.011008 (-0.006553)	0.097978 / 0.038508 (0.059470)	0.027614 / 0.023109 (0.004505)	0.309833 / 0.275898 (0.033935)	0.337006 / 0.323480 (0.013526)	0.004986 / 0.007986 (-0.002999)	0.004521 / 0.004328 (0.000193)	0.075053 / 0.004250 (0.070803)	0.037095 / 0.037052 (0.000043)	0.305430 / 0.258489 (0.046941)	0.345298 / 0.293841 (0.051457)	0.029784 / 0.128546 (-0.098762)	0.011449 / 0.075646 (-0.064197)	0.323346 / 0.419271 (-0.095925)	0.042188 / 0.043533 (-0.001345)	0.318653 / 0.255139 (0.063514)	0.333799 / 0.283200 (0.050599)	0.088194 / 0.141683 (-0.053488)	1.511012 / 1.452155 (0.058857)	1.578205 / 1.492716 (0.085489)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.229695 / 0.018006 (0.211689)	0.413276 / 0.000490 (0.412786)	0.009142 / 0.000200 (0.008942)	0.000537 / 0.000054 (0.000482)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024327 / 0.037411 (-0.013084)	0.097953 / 0.014526 (0.083427)	0.105551 / 0.176557 (-0.071005)	0.169397 / 0.737135 (-0.567738)	0.109784 / 0.296338 (-0.186554)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.417713 / 0.215209 (0.202504)	4.190703 / 2.077655 (2.113048)	1.873504 / 1.504120 (0.369384)	1.664540 / 1.541195 (0.123346)	1.704539 / 1.468490 (0.236049)	0.699840 / 4.584777 (-3.884937)	3.480605 / 3.745712 (-0.265107)	1.844229 / 5.269862 (-3.425633)	1.155793 / 4.565676 (-3.409883)	0.083013 / 0.424275 (-0.341262)	0.012414 / 0.007607 (0.004807)	0.518357 / 0.226044 (0.292313)	5.186136 / 2.268929 (2.917207)	2.329263 / 55.444624 (-53.115361)	1.991395 / 6.876477 (-4.885081)	2.074563 / 2.142072 (-0.067509)	0.801388 / 4.805227 (-4.003839)	0.152236 / 6.500664 (-6.348428)	0.067414 / 0.075469 (-0.008055)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.197290 / 1.841788 (-0.644497)	13.666537 / 8.074308 (5.592229)	13.017190 / 10.191392 (2.825798)	0.142109 / 0.680424 (-0.538314)	0.016321 / 0.534201 (-0.517880)	0.378434 / 0.579283 (-0.200849)	0.381101 / 0.434364 (-0.053263)	0.444113 / 0.540337 (-0.096225)	0.521448 / 1.386936 (-0.865488)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006273 / 0.011353 (-0.005080)	0.004408 / 0.011008 (-0.006600)	0.077100 / 0.038508 (0.038592)	0.027361 / 0.023109 (0.004251)	0.358170 / 0.275898 (0.082272)	0.390125 / 0.323480 (0.066646)	0.004736 / 0.007986 (-0.003250)	0.004663 / 0.004328 (0.000334)	0.077626 / 0.004250 (0.073376)	0.037103 / 0.037052 (0.000051)	0.360044 / 0.258489 (0.101555)	0.411539 / 0.293841 (0.117698)	0.030173 / 0.128546 (-0.098373)	0.011618 / 0.075646 (-0.064028)	0.086036 / 0.419271 (-0.333235)	0.039077 / 0.043533 (-0.004456)	0.382223 / 0.255139 (0.127084)	0.384817 / 0.283200 (0.101618)	0.094591 / 0.141683 (-0.047092)	1.494961 / 1.452155 (0.042807)	1.583769 / 1.492716 (0.091053)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.227467 / 0.018006 (0.209460)	0.396648 / 0.000490 (0.396159)	0.000382 / 0.000200 (0.000182)	0.000057 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025346 / 0.037411 (-0.012065)	0.102086 / 0.014526 (0.087560)	0.108570 / 0.176557 (-0.067986)	0.158777 / 0.737135 (-0.578359)	0.112885 / 0.296338 (-0.183453)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.460731 / 0.215209 (0.245522)	4.556450 / 2.077655 (2.478795)	2.258185 / 1.504120 (0.754065)	2.122584 / 1.541195 (0.581389)	2.224638 / 1.468490 (0.756148)	0.691909 / 4.584777 (-3.892868)	3.482634 / 3.745712 (-0.263078)	2.772837 / 5.269862 (-2.497024)	1.533897 / 4.565676 (-3.031780)	0.083025 / 0.424275 (-0.341250)	0.012629 / 0.007607 (0.005022)	0.548397 / 0.226044 (0.322352)	5.492005 / 2.268929 (3.223077)	2.669841 / 55.444624 (-52.774784)	2.366947 / 6.876477 (-4.509529)	2.496795 / 2.142072 (0.354722)	0.804868 / 4.805227 (-4.000359)	0.151686 / 6.500664 (-6.348978)	0.068333 / 0.075469 (-0.007136)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.320414 / 1.841788 (-0.521374)	14.367567 / 8.074308 (6.293258)	14.047702 / 10.191392 (3.856310)	0.129087 / 0.680424 (-0.551337)	0.016658 / 0.534201 (-0.517543)	0.381949 / 0.579283 (-0.197335)	0.390105 / 0.434364 (-0.044258)	0.445947 / 0.540337 (-0.094390)	0.531074 / 1.386936 (-0.855862)

mariosasko added 2 commits May 3, 2023 19:59

Preserve strategy when shuffling interleaved dataset (random cycling …

21f0e39

…case)

Annotate stopping_strategy with Literal

06988d3

mariosasko requested a review from lhoestq May 3, 2023 19:09

lhoestq approved these changes May 4, 2023

View reviewed changes

mariosasko merged commit c67c9f3 into main May 4, 2023

mariosasko deleted the fix-5812 branch May 4, 2023 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve `stopping_strategy` of shuffled interleaved dataset (random cycling case) #5816

Preserve `stopping_strategy` of shuffled interleaved dataset (random cycling case) #5816

mariosasko commented May 3, 2023

github-actions bot commented May 3, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented May 3, 2023 •

edited

Loading

lhoestq left a comment

github-actions bot commented May 4, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Preserve stopping_strategy of shuffled interleaved dataset (random cycling case) #5816

Preserve stopping_strategy of shuffled interleaved dataset (random cycling case) #5816

Conversation

mariosasko commented May 3, 2023

github-actions bot commented May 3, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented May 3, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented May 4, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Preserve `stopping_strategy` of shuffled interleaved dataset (random cycling case) #5816

Preserve `stopping_strategy` of shuffled interleaved dataset (random cycling case) #5816

HuggingFaceDocBuilderDev commented May 3, 2023 •

edited

Loading