Make prepare_split more robust if errors in metadata dataset_info splits #5901

albertvillanova · 2023-05-26T08:48:22Z

This PR uses split_generator.split_info as default value for split_info if any exception is raised while trying to get split_generator.name from self.info.splits (this may happen if there is any error in the metadata dataset_info splits).

Please note that split_info is only used by the logger.

Fix #5895 if passed verification_mode="no_checks":

ds = load_dataset(
    "ArmelR/stack-exchange-instruction", 
    data_dir="data/finetune", 
    split="train", 
    verification_mode="no_checks", 
    revision="c609f1caade5cfbf3b9fe9cfa17d7cb000b457bd",
)

HuggingFaceDocBuilderDev · 2023-05-26T08:53:00Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-06-01T13:49:11Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008809 / 0.011353 (-0.002544)	0.005641 / 0.011008 (-0.005367)	0.124986 / 0.038508 (0.086477)	0.037311 / 0.023109 (0.014202)	0.388915 / 0.275898 (0.113017)	0.430123 / 0.323480 (0.106643)	0.007447 / 0.007986 (-0.000538)	0.009593 / 0.004328 (0.005264)	0.099148 / 0.004250 (0.094898)	0.052393 / 0.037052 (0.015341)	0.399779 / 0.258489 (0.141290)	0.439109 / 0.293841 (0.145268)	0.043409 / 0.128546 (-0.085137)	0.016286 / 0.075646 (-0.059360)	0.431198 / 0.419271 (0.011927)	0.064932 / 0.043533 (0.021400)	0.390650 / 0.255139 (0.135511)	0.432883 / 0.283200 (0.149684)	0.110978 / 0.141683 (-0.030705)	1.796121 / 1.452155 (0.343967)	1.960097 / 1.492716 (0.467381)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.286292 / 0.018006 (0.268286)	0.659495 / 0.000490 (0.659005)	0.008294 / 0.000200 (0.008094)	0.000485 / 0.000054 (0.000431)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029325 / 0.037411 (-0.008086)	0.125454 / 0.014526 (0.110928)	0.136459 / 0.176557 (-0.040097)	0.221075 / 0.737135 (-0.516060)	0.140281 / 0.296338 (-0.156058)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.602401 / 0.215209 (0.387192)	6.124553 / 2.077655 (4.046898)	2.453141 / 1.504120 (0.949021)	2.038611 / 1.541195 (0.497416)	2.073611 / 1.468490 (0.605121)	0.938040 / 4.584777 (-3.646737)	5.755972 / 3.745712 (2.010260)	4.450935 / 5.269862 (-0.818926)	2.337219 / 4.565676 (-2.228457)	0.107118 / 0.424275 (-0.317157)	0.015201 / 0.007607 (0.007594)	0.785833 / 0.226044 (0.559788)	7.732984 / 2.268929 (5.464055)	3.236892 / 55.444624 (-52.207733)	2.696402 / 6.876477 (-4.180074)	2.805036 / 2.142072 (0.662964)	1.108612 / 4.805227 (-3.696616)	0.221067 / 6.500664 (-6.279597)	0.085538 / 0.075469 (0.010068)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.600311 / 1.841788 (-0.241476)	18.528118 / 8.074308 (10.453810)	21.107199 / 10.191392 (10.915807)	0.219489 / 0.680424 (-0.460934)	0.028927 / 0.534201 (-0.505274)	0.503446 / 0.579283 (-0.075837)	0.619833 / 0.434364 (0.185469)	0.582454 / 0.540337 (0.042117)	0.709154 / 1.386936 (-0.677782)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008516 / 0.011353 (-0.002837)	0.006090 / 0.011008 (-0.004918)	0.104574 / 0.038508 (0.066066)	0.042676 / 0.023109 (0.019566)	0.458623 / 0.275898 (0.182725)	0.568479 / 0.323480 (0.244999)	0.008374 / 0.007986 (0.000389)	0.004677 / 0.004328 (0.000349)	0.105946 / 0.004250 (0.101695)	0.055256 / 0.037052 (0.018204)	0.511036 / 0.258489 (0.252547)	0.598383 / 0.293841 (0.304542)	0.043612 / 0.128546 (-0.084934)	0.014707 / 0.075646 (-0.060940)	0.116350 / 0.419271 (-0.302921)	0.061413 / 0.043533 (0.017880)	0.477785 / 0.255139 (0.222646)	0.542643 / 0.283200 (0.259443)	0.120431 / 0.141683 (-0.021252)	1.994083 / 1.452155 (0.541928)	2.100600 / 1.492716 (0.607883)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.298480 / 0.018006 (0.280474)	0.601921 / 0.000490 (0.601432)	0.000445 / 0.000200 (0.000245)	0.000086 / 0.000054 (0.000032)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034784 / 0.037411 (-0.002627)	0.133555 / 0.014526 (0.119029)	0.138541 / 0.176557 (-0.038015)	0.203114 / 0.737135 (-0.534021)	0.153477 / 0.296338 (-0.142861)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.780484 / 0.215209 (0.565275)	7.150876 / 2.077655 (5.073222)	3.168590 / 1.504120 (1.664470)	2.698746 / 1.541195 (1.157552)	2.695678 / 1.468490 (1.227188)	1.037706 / 4.584777 (-3.547071)	5.672631 / 3.745712 (1.926918)	2.798137 / 5.269862 (-2.471725)	1.738588 / 4.565676 (-2.827088)	0.111160 / 0.424275 (-0.313115)	0.013878 / 0.007607 (0.006271)	0.800191 / 0.226044 (0.574146)	8.546676 / 2.268929 (6.277748)	4.116852 / 55.444624 (-51.327773)	3.331271 / 6.876477 (-3.545206)	3.307410 / 2.142072 (1.165337)	1.191019 / 4.805227 (-3.614208)	0.248953 / 6.500664 (-6.251711)	0.086632 / 0.075469 (0.011162)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.795057 / 1.841788 (-0.046730)	18.038785 / 8.074308 (9.964476)	21.865566 / 10.191392 (11.674174)	0.211058 / 0.680424 (-0.469366)	0.026956 / 0.534201 (-0.507245)	0.518855 / 0.579283 (-0.060428)	0.618105 / 0.434364 (0.183741)	0.569227 / 0.540337 (0.028889)	0.705431 / 1.386936 (-0.681505)

github-actions · 2023-06-02T06:06:37Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008900 / 0.011353 (-0.002453)	0.005726 / 0.011008 (-0.005283)	0.131747 / 0.038508 (0.093239)	0.040585 / 0.023109 (0.017476)	0.420531 / 0.275898 (0.144633)	0.459430 / 0.323480 (0.135950)	0.007642 / 0.007986 (-0.000344)	0.006750 / 0.004328 (0.002421)	0.099147 / 0.004250 (0.094897)	0.055852 / 0.037052 (0.018799)	0.423653 / 0.258489 (0.165164)	0.453304 / 0.293841 (0.159463)	0.045247 / 0.128546 (-0.083300)	0.016034 / 0.075646 (-0.059612)	0.443115 / 0.419271 (0.023843)	0.078853 / 0.043533 (0.035320)	0.417508 / 0.255139 (0.162369)	0.440936 / 0.283200 (0.157736)	0.115603 / 0.141683 (-0.026080)	1.844610 / 1.452155 (0.392456)	1.998497 / 1.492716 (0.505781)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.272622 / 0.018006 (0.254616)	0.598045 / 0.000490 (0.597556)	0.007088 / 0.000200 (0.006888)	0.000159 / 0.000054 (0.000105)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032976 / 0.037411 (-0.004436)	0.143970 / 0.014526 (0.129444)	0.142172 / 0.176557 (-0.034384)	0.216747 / 0.737135 (-0.520389)	0.146004 / 0.296338 (-0.150334)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.687507 / 0.215209 (0.472298)	6.549524 / 2.077655 (4.471870)	2.924142 / 1.504120 (1.420022)	2.504471 / 1.541195 (0.963277)	2.496280 / 1.468490 (1.027790)	0.959054 / 4.584777 (-3.625723)	5.851742 / 3.745712 (2.106030)	4.983357 / 5.269862 (-0.286504)	2.627403 / 4.565676 (-1.938274)	0.112955 / 0.424275 (-0.311320)	0.016206 / 0.007607 (0.008599)	0.819158 / 0.226044 (0.593114)	8.416949 / 2.268929 (6.148020)	3.776765 / 55.444624 (-51.667859)	3.002397 / 6.876477 (-3.874080)	3.158852 / 2.142072 (1.016779)	1.197099 / 4.805227 (-3.608129)	0.280654 / 6.500664 (-6.220010)	0.099471 / 0.075469 (0.024002)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.687007 / 1.841788 (-0.154781)	19.411976 / 8.074308 (11.337668)	22.053482 / 10.191392 (11.862090)	0.228038 / 0.680424 (-0.452386)	0.028226 / 0.534201 (-0.505975)	0.527695 / 0.579283 (-0.051588)	0.635911 / 0.434364 (0.201547)	0.618205 / 0.540337 (0.077868)	0.735164 / 1.386936 (-0.651772)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009450 / 0.011353 (-0.001903)	0.006566 / 0.011008 (-0.004442)	0.108919 / 0.038508 (0.070411)	0.050010 / 0.023109 (0.026900)	0.505168 / 0.275898 (0.229270)	0.552190 / 0.323480 (0.228710)	0.007569 / 0.007986 (-0.000417)	0.006807 / 0.004328 (0.002478)	0.116621 / 0.004250 (0.112371)	0.060374 / 0.037052 (0.023321)	0.515165 / 0.258489 (0.256676)	0.572125 / 0.293841 (0.278284)	0.046561 / 0.128546 (-0.081986)	0.016159 / 0.075646 (-0.059487)	0.114568 / 0.419271 (-0.304704)	0.064689 / 0.043533 (0.021157)	0.497870 / 0.255139 (0.242731)	0.567332 / 0.283200 (0.284132)	0.126254 / 0.141683 (-0.015429)	1.954074 / 1.452155 (0.501919)	2.057682 / 1.492716 (0.564966)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.013857 / 0.018006 (-0.004149)	0.601561 / 0.000490 (0.601071)	0.002897 / 0.000200 (0.002697)	0.000108 / 0.000054 (0.000053)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.038480 / 0.037411 (0.001069)	0.142480 / 0.014526 (0.127954)	0.160479 / 0.176557 (-0.016077)	0.217942 / 0.737135 (-0.519194)	0.159908 / 0.296338 (-0.136431)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.697926 / 0.215209 (0.482717)	6.869754 / 2.077655 (4.792100)	3.125463 / 1.504120 (1.621343)	2.729123 / 1.541195 (1.187928)	2.855747 / 1.468490 (1.387257)	1.015345 / 4.584777 (-3.569432)	5.839176 / 3.745712 (2.093463)	5.019678 / 5.269862 (-0.250184)	2.080489 / 4.565676 (-2.485187)	0.118884 / 0.424275 (-0.305391)	0.021381 / 0.007607 (0.013774)	0.877847 / 0.226044 (0.651803)	8.714561 / 2.268929 (6.445633)	3.933399 / 55.444624 (-51.511226)	3.281809 / 6.876477 (-3.594668)	3.330342 / 2.142072 (1.188269)	1.235005 / 4.805227 (-3.570222)	0.239686 / 6.500664 (-6.260978)	0.093546 / 0.075469 (0.018077)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.787916 / 1.841788 (-0.053872)	20.094828 / 8.074308 (12.020520)	22.902101 / 10.191392 (12.710709)	0.249315 / 0.680424 (-0.431109)	0.028058 / 0.534201 (-0.506143)	0.524960 / 0.579283 (-0.054323)	0.643881 / 0.434364 (0.209517)	0.621203 / 0.540337 (0.080866)	0.723337 / 1.386936 (-0.663599)

Use split_generator.split_info if exception with self.info.splits

28c624e

albertvillanova changed the title ~~Make prepare_split more robust if errors in metada dataset_info splits~~ Make prepare_split more robust if errors in metadata dataset_info splits May 26, 2023

albertvillanova merged commit 074925b into huggingface:main Jun 1, 2023

albertvillanova deleted the fix-5895 branch June 1, 2023 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make prepare_split more robust if errors in metadata dataset_info splits #5901

Make prepare_split more robust if errors in metadata dataset_info splits #5901

albertvillanova commented May 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 26, 2023 •

edited

Loading

github-actions bot commented Jun 1, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jun 2, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Make prepare_split more robust if errors in metadata dataset_info splits #5901

Make prepare_split more robust if errors in metadata dataset_info splits #5901

Conversation

albertvillanova commented May 26, 2023 • edited Loading

HuggingFaceDocBuilderDev commented May 26, 2023 • edited Loading

github-actions bot commented Jun 1, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Jun 2, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

albertvillanova commented May 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 26, 2023 •

edited

Loading