Deterministic set hash #6318

lhoestq · 2023-10-19T12:19:13Z

Sort the items in a set according to their datasets.fingerprint.Hasher.hash hash to get a deterministic hash of sets.

This is useful to get deterministic hashes of tokenizers that use a trie based on python sets.

reported in #3847

HuggingFaceDocBuilderDev · 2023-10-19T12:25:36Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-10-19T12:28:24Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006827 / 0.011353 (-0.004526)	0.004468 / 0.011008 (-0.006540)	0.088687 / 0.038508 (0.050179)	0.072560 / 0.023109 (0.049451)	0.333421 / 0.275898 (0.057523)	0.374977 / 0.323480 (0.051497)	0.005829 / 0.007986 (-0.002156)	0.003284 / 0.004328 (-0.001045)	0.068929 / 0.004250 (0.064678)	0.057212 / 0.037052 (0.020160)	0.328911 / 0.258489 (0.070422)	0.389107 / 0.293841 (0.095266)	0.033518 / 0.128546 (-0.095029)	0.009919 / 0.075646 (-0.065728)	0.308100 / 0.419271 (-0.111171)	0.059380 / 0.043533 (0.015847)	0.345587 / 0.255139 (0.090448)	0.353703 / 0.283200 (0.070503)	0.026454 / 0.141683 (-0.115229)	1.573309 / 1.452155 (0.121155)	1.663812 / 1.492716 (0.171095)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.255081 / 0.018006 (0.237075)	0.472613 / 0.000490 (0.472123)	0.016120 / 0.000200 (0.015920)	0.000383 / 0.000054 (0.000328)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028219 / 0.037411 (-0.009192)	0.086600 / 0.014526 (0.072074)	0.099484 / 0.176557 (-0.077073)	0.154604 / 0.737135 (-0.582531)	0.099168 / 0.296338 (-0.197171)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.421703 / 0.215209 (0.206494)	4.188600 / 2.077655 (2.110945)	2.037575 / 1.504120 (0.533456)	1.843389 / 1.541195 (0.302194)	1.912554 / 1.468490 (0.444064)	0.517452 / 4.584777 (-4.067325)	3.838002 / 3.745712 (0.092290)	3.698899 / 5.269862 (-1.570963)	2.175393 / 4.565676 (-2.390283)	0.066059 / 0.424275 (-0.358216)	0.008455 / 0.007607 (0.000848)	0.506813 / 0.226044 (0.280768)	4.826994 / 2.268929 (2.558066)	2.544437 / 55.444624 (-52.900187)	2.164938 / 6.876477 (-4.711539)	2.171725 / 2.142072 (0.029652)	0.603757 / 4.805227 (-4.201470)	0.149113 / 6.500664 (-6.351551)	0.065093 / 0.075469 (-0.010376)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.366887 / 1.841788 (-0.474901)	20.508089 / 8.074308 (12.433780)	14.836531 / 10.191392 (4.645139)	0.167418 / 0.680424 (-0.513006)	0.019707 / 0.534201 (-0.514494)	0.409897 / 0.579283 (-0.169387)	0.439412 / 0.434364 (0.005048)	0.495784 / 0.540337 (-0.044553)	0.685367 / 1.386936 (-0.701569)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007604 / 0.011353 (-0.003749)	0.004368 / 0.011008 (-0.006640)	0.072628 / 0.038508 (0.034120)	0.084187 / 0.023109 (0.061077)	0.461396 / 0.275898 (0.185498)	0.481429 / 0.323480 (0.157949)	0.005894 / 0.007986 (-0.002092)	0.003472 / 0.004328 (-0.000857)	0.068717 / 0.004250 (0.064466)	0.061066 / 0.037052 (0.024014)	0.464217 / 0.258489 (0.205728)	0.498061 / 0.293841 (0.204220)	0.035458 / 0.128546 (-0.093089)	0.009474 / 0.075646 (-0.066173)	0.079633 / 0.419271 (-0.339639)	0.053966 / 0.043533 (0.010433)	0.454911 / 0.255139 (0.199772)	0.470837 / 0.283200 (0.187637)	0.026358 / 0.141683 (-0.115325)	1.665131 / 1.452155 (0.212976)	1.730365 / 1.492716 (0.237648)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.234810 / 0.018006 (0.216804)	0.453672 / 0.000490 (0.453183)	0.004620 / 0.000200 (0.004420)	0.000119 / 0.000054 (0.000064)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035310 / 0.037411 (-0.002101)	0.100379 / 0.014526 (0.085853)	0.118802 / 0.176557 (-0.057754)	0.173853 / 0.737135 (-0.563282)	0.115714 / 0.296338 (-0.180624)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.466797 / 0.215209 (0.251588)	4.698324 / 2.077655 (2.620670)	2.446897 / 1.504120 (0.942777)	2.277346 / 1.541195 (0.736151)	2.347211 / 1.468490 (0.878721)	0.514377 / 4.584777 (-4.070400)	3.931269 / 3.745712 (0.185557)	3.573575 / 5.269862 (-1.696286)	2.208122 / 4.565676 (-2.357554)	0.061081 / 0.424275 (-0.363194)	0.007803 / 0.007607 (0.000196)	0.544376 / 0.226044 (0.318332)	5.440003 / 2.268929 (3.171074)	3.012559 / 55.444624 (-52.432065)	2.617286 / 6.876477 (-4.259191)	2.863978 / 2.142072 (0.721906)	0.610024 / 4.805227 (-4.195203)	0.133643 / 6.500664 (-6.367021)	0.064766 / 0.075469 (-0.010703)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.465225 / 1.841788 (-0.376563)	21.308351 / 8.074308 (13.234043)	15.176634 / 10.191392 (4.985242)	0.172701 / 0.680424 (-0.507723)	0.020345 / 0.534201 (-0.513855)	0.433923 / 0.579283 (-0.145360)	0.450183 / 0.434364 (0.015819)	0.514048 / 0.540337 (-0.026289)	0.736302 / 1.386936 (-0.650634)

mariosasko

Thanks, LGTM!

github-actions · 2023-10-19T16:27:20Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008305 / 0.011353 (-0.003048)	0.006007 / 0.011008 (-0.005001)	0.103521 / 0.038508 (0.065013)	0.075776 / 0.023109 (0.052666)	0.378888 / 0.275898 (0.102990)	0.405245 / 0.323480 (0.081765)	0.004596 / 0.007986 (-0.003390)	0.003687 / 0.004328 (-0.000641)	0.079043 / 0.004250 (0.074792)	0.055895 / 0.037052 (0.018843)	0.406565 / 0.258489 (0.148076)	0.433869 / 0.293841 (0.140028)	0.045321 / 0.128546 (-0.083226)	0.014317 / 0.075646 (-0.061329)	0.345312 / 0.419271 (-0.073960)	0.064485 / 0.043533 (0.020953)	0.381744 / 0.255139 (0.126605)	0.401162 / 0.283200 (0.117962)	0.035973 / 0.141683 (-0.105709)	1.829616 / 1.452155 (0.377461)	1.868487 / 1.492716 (0.375771)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.245432 / 0.018006 (0.227426)	0.494249 / 0.000490 (0.493759)	0.010878 / 0.000200 (0.010678)	0.000492 / 0.000054 (0.000437)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032778 / 0.037411 (-0.004633)	0.103418 / 0.014526 (0.088892)	0.108010 / 0.176557 (-0.068547)	0.176477 / 0.737135 (-0.560658)	0.107732 / 0.296338 (-0.188606)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.572471 / 0.215209 (0.357262)	5.647039 / 2.077655 (3.569384)	2.385069 / 1.504120 (0.880949)	2.048928 / 1.541195 (0.507733)	2.108538 / 1.468490 (0.640048)	0.861436 / 4.584777 (-3.723341)	4.933452 / 3.745712 (1.187739)	4.735219 / 5.269862 (-0.534642)	2.926971 / 4.565676 (-1.638705)	0.097687 / 0.424275 (-0.326588)	0.008346 / 0.007607 (0.000739)	0.677754 / 0.226044 (0.451709)	6.798433 / 2.268929 (4.529504)	3.129862 / 55.444624 (-52.314762)	2.454033 / 6.876477 (-4.422444)	2.464590 / 2.142072 (0.322517)	1.034497 / 4.805227 (-3.770730)	0.205753 / 6.500664 (-6.294911)	0.076618 / 0.075469 (0.001149)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.617569 / 1.841788 (-0.224219)	22.091489 / 8.074308 (14.017181)	20.406312 / 10.191392 (10.214920)	0.222012 / 0.680424 (-0.458411)	0.027787 / 0.534201 (-0.506414)	0.441669 / 0.579283 (-0.137615)	0.564773 / 0.434364 (0.130409)	0.510389 / 0.540337 (-0.029948)	0.753672 / 1.386936 (-0.633264)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.011107 / 0.011353 (-0.000246)	0.004973 / 0.011008 (-0.006035)	0.078331 / 0.038508 (0.039823)	0.083964 / 0.023109 (0.060855)	0.518980 / 0.275898 (0.243082)	0.528264 / 0.323480 (0.204784)	0.007452 / 0.007986 (-0.000534)	0.003931 / 0.004328 (-0.000397)	0.079724 / 0.004250 (0.075474)	0.061739 / 0.037052 (0.024686)	0.517804 / 0.258489 (0.259315)	0.582764 / 0.293841 (0.288923)	0.049674 / 0.128546 (-0.078873)	0.014540 / 0.075646 (-0.061106)	0.093130 / 0.419271 (-0.326141)	0.060647 / 0.043533 (0.017114)	0.492628 / 0.255139 (0.237489)	0.549761 / 0.283200 (0.266562)	0.034313 / 0.141683 (-0.107369)	1.824574 / 1.452155 (0.372419)	2.013664 / 1.492716 (0.520947)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.231335 / 0.018006 (0.213329)	0.521477 / 0.000490 (0.520987)	0.011314 / 0.000200 (0.011114)	0.000397 / 0.000054 (0.000343)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033303 / 0.037411 (-0.004108)	0.098238 / 0.014526 (0.083712)	0.119527 / 0.176557 (-0.057030)	0.169163 / 0.737135 (-0.567972)	0.114536 / 0.296338 (-0.181803)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.578401 / 0.215209 (0.363191)	5.966438 / 2.077655 (3.888783)	2.646370 / 1.504120 (1.142250)	2.361833 / 1.541195 (0.820638)	2.476573 / 1.468490 (1.008083)	0.777411 / 4.584777 (-3.807366)	4.811070 / 3.745712 (1.065357)	4.314221 / 5.269862 (-0.955641)	2.743317 / 4.565676 (-1.822359)	0.110394 / 0.424275 (-0.313881)	0.008333 / 0.007607 (0.000726)	0.729588 / 0.226044 (0.503543)	7.743226 / 2.268929 (5.474298)	3.606294 / 55.444624 (-51.838330)	2.838069 / 6.876477 (-4.038408)	3.087494 / 2.142072 (0.945421)	1.053341 / 4.805227 (-3.751886)	0.205105 / 6.500664 (-6.295559)	0.075204 / 0.075469 (-0.000265)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.561959 / 1.841788 (-0.279829)	21.407849 / 8.074308 (13.333541)	19.084263 / 10.191392 (8.892871)	0.226129 / 0.680424 (-0.454295)	0.029695 / 0.534201 (-0.504506)	0.427035 / 0.579283 (-0.152248)	0.565353 / 0.434364 (0.130989)	0.526789 / 0.540337 (-0.013548)	0.734820 / 1.386936 (-0.652116)

lhoestq added 2 commits October 19, 2023 14:18

deterministic set hash

0b13f87

tests

7f1a7d6

lhoestq mentioned this pull request Oct 19, 2023

Datasets' cache not re-used #3847

Open

lhoestq marked this pull request as ready for review October 19, 2023 15:51

lhoestq requested a review from mariosasko October 19, 2023 15:53

mariosasko approved these changes Oct 19, 2023

View reviewed changes

lhoestq merged commit 5b52536 into main Oct 19, 2023
13 checks passed

lhoestq deleted the deterministic-set-hash branch October 19, 2023 16:16

albertvillanova linked an issue Oct 20, 2023 that may be closed by this pull request

Datasets' cache not re-used #3847

Open

enze5088 mentioned this pull request Nov 1, 2023

Multi process map did not load cache file correctly #6369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic set hash #6318

Deterministic set hash #6318

lhoestq commented Oct 19, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 19, 2023 •

edited

Loading

github-actions bot commented Oct 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko left a comment

github-actions bot commented Oct 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Deterministic set hash #6318

Deterministic set hash #6318

Conversation

lhoestq commented Oct 19, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Oct 19, 2023 • edited Loading

github-actions bot commented Oct 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

mariosasko left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 19, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Oct 19, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 19, 2023 •

edited

Loading