Add pre-commit config yaml file to enable automatic code formatting #5561

polinaeterna · 2023-02-21T17:35:07Z

@huggingface/datasets do you think it would be useful? Motivation - sometimes PRs are like 30% "fix: style" commits :)

If so - I need to double check the config but for me locally it works as expected.

HuggingFaceDocBuilderDev · 2023-02-21T17:39:22Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

sounds good to me - maybe @albertvillanova is also familiar with this ?

nateraw

LGTM! 🤗

Only gripe I have with pre-commit is when you're forced to use it. With this being optional, its a super nice addition for folks who prefer to use it vs Makefile :). Thank you for adding

Skylion007 · 2023-02-21T19:35:47Z

Better yet have someone enable pre-commit CI https://pre-commit.ci/ and it will apply the pre-commit fixes to the PR automatically as an additional commit.

.pre-commit-config.yaml

polinaeterna · 2023-02-23T13:43:31Z

@Skylion007 hi! I agree with @nateraw here, I'd better not force to use pre-commit so I'm not setting it up in the CI for now. And regarding end-of-file - currently it's being done by black.

github-actions · 2023-02-23T18:30:21Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008704 / 0.011353 (-0.002649)	0.004448 / 0.011008 (-0.006560)	0.099530 / 0.038508 (0.061022)	0.029739 / 0.023109 (0.006629)	0.329267 / 0.275898 (0.053369)	0.368805 / 0.323480 (0.045325)	0.006852 / 0.007986 (-0.001133)	0.004575 / 0.004328 (0.000246)	0.076838 / 0.004250 (0.072588)	0.033885 / 0.037052 (-0.003167)	0.336340 / 0.258489 (0.077851)	0.384880 / 0.293841 (0.091039)	0.034051 / 0.128546 (-0.094495)	0.011638 / 0.075646 (-0.064009)	0.321650 / 0.419271 (-0.097622)	0.041202 / 0.043533 (-0.002330)	0.330841 / 0.255139 (0.075702)	0.361329 / 0.283200 (0.078130)	0.084864 / 0.141683 (-0.056819)	1.454005 / 1.452155 (0.001850)	1.542167 / 1.492716 (0.049451)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.196207 / 0.018006 (0.178200)	0.400675 / 0.000490 (0.400185)	0.000403 / 0.000200 (0.000203)	0.000059 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022694 / 0.037411 (-0.014717)	0.095139 / 0.014526 (0.080613)	0.104129 / 0.176557 (-0.072427)	0.168688 / 0.737135 (-0.568447)	0.109243 / 0.296338 (-0.187096)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.427520 / 0.215209 (0.212311)	4.237726 / 2.077655 (2.160071)	2.191887 / 1.504120 (0.687767)	1.987750 / 1.541195 (0.446555)	1.996540 / 1.468490 (0.528050)	0.696416 / 4.584777 (-3.888361)	3.454536 / 3.745712 (-0.291176)	2.023600 / 5.269862 (-3.246261)	1.336394 / 4.565676 (-3.229282)	0.082933 / 0.424275 (-0.341342)	0.012572 / 0.007607 (0.004965)	0.534330 / 0.226044 (0.308285)	5.347588 / 2.268929 (3.078659)	2.640397 / 55.444624 (-52.804228)	2.338266 / 6.876477 (-4.538211)	2.431969 / 2.142072 (0.289897)	0.821335 / 4.805227 (-3.983893)	0.151905 / 6.500664 (-6.348759)	0.067983 / 0.075469 (-0.007486)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.228841 / 1.841788 (-0.612947)	13.660437 / 8.074308 (5.586128)	13.729442 / 10.191392 (3.538050)	0.165835 / 0.680424 (-0.514589)	0.028753 / 0.534201 (-0.505448)	0.400143 / 0.579283 (-0.179140)	0.403714 / 0.434364 (-0.030650)	0.492168 / 0.540337 (-0.048170)	0.581151 / 1.386936 (-0.805785)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006289 / 0.011353 (-0.005064)	0.004419 / 0.011008 (-0.006589)	0.077220 / 0.038508 (0.038712)	0.027170 / 0.023109 (0.004060)	0.344988 / 0.275898 (0.069090)	0.374150 / 0.323480 (0.050670)	0.004842 / 0.007986 (-0.003144)	0.003289 / 0.004328 (-0.001039)	0.076200 / 0.004250 (0.071950)	0.036287 / 0.037052 (-0.000766)	0.345764 / 0.258489 (0.087275)	0.387439 / 0.293841 (0.093599)	0.031547 / 0.128546 (-0.096999)	0.011586 / 0.075646 (-0.064060)	0.086599 / 0.419271 (-0.332672)	0.042338 / 0.043533 (-0.001195)	0.355384 / 0.255139 (0.100246)	0.369474 / 0.283200 (0.086275)	0.090945 / 0.141683 (-0.050738)	1.488632 / 1.452155 (0.036477)	1.554606 / 1.492716 (0.061890)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.212962 / 0.018006 (0.194956)	0.399647 / 0.000490 (0.399157)	0.003055 / 0.000200 (0.002856)	0.000083 / 0.000054 (0.000029)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024349 / 0.037411 (-0.013062)	0.100342 / 0.014526 (0.085817)	0.105657 / 0.176557 (-0.070899)	0.175139 / 0.737135 (-0.561997)	0.110014 / 0.296338 (-0.186324)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434785 / 0.215209 (0.219575)	4.346950 / 2.077655 (2.269295)	2.045411 / 1.504120 (0.541291)	1.844258 / 1.541195 (0.303064)	1.889503 / 1.468490 (0.421013)	0.704530 / 4.584777 (-3.880247)	3.362435 / 3.745712 (-0.383277)	2.797205 / 5.269862 (-2.472656)	1.504431 / 4.565676 (-3.061245)	0.083331 / 0.424275 (-0.340945)	0.012274 / 0.007607 (0.004666)	0.531123 / 0.226044 (0.305078)	5.322588 / 2.268929 (3.053660)	2.483875 / 55.444624 (-52.960750)	2.147218 / 6.876477 (-4.729258)	2.164024 / 2.142072 (0.021952)	0.807191 / 4.805227 (-3.998036)	0.151189 / 6.500664 (-6.349475)	0.068027 / 0.075469 (-0.007442)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.316001 / 1.841788 (-0.525787)	13.892785 / 8.074308 (5.818477)	13.485982 / 10.191392 (3.294590)	0.138904 / 0.680424 (-0.541520)	0.016748 / 0.534201 (-0.517453)	0.379840 / 0.579283 (-0.199443)	0.384854 / 0.434364 (-0.049510)	0.464275 / 0.540337 (-0.076063)	0.553622 / 1.386936 (-0.833314)

github-actions · 2023-02-27T17:07:27Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009179 / 0.011353 (-0.002174)	0.005080 / 0.011008 (-0.005929)	0.099061 / 0.038508 (0.060553)	0.035252 / 0.023109 (0.012143)	0.293496 / 0.275898 (0.017598)	0.360365 / 0.323480 (0.036886)	0.007757 / 0.007986 (-0.000229)	0.003985 / 0.004328 (-0.000343)	0.076021 / 0.004250 (0.071771)	0.042286 / 0.037052 (0.005233)	0.316542 / 0.258489 (0.058053)	0.341711 / 0.293841 (0.047870)	0.037970 / 0.128546 (-0.090576)	0.011977 / 0.075646 (-0.063670)	0.333341 / 0.419271 (-0.085931)	0.049211 / 0.043533 (0.005678)	0.297401 / 0.255139 (0.042262)	0.313424 / 0.283200 (0.030224)	0.105719 / 0.141683 (-0.035964)	1.487879 / 1.452155 (0.035724)	1.529785 / 1.492716 (0.037068)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.201062 / 0.018006 (0.183056)	0.438024 / 0.000490 (0.437534)	0.002129 / 0.000200 (0.001929)	0.000083 / 0.000054 (0.000028)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026422 / 0.037411 (-0.010989)	0.104863 / 0.014526 (0.090337)	0.114934 / 0.176557 (-0.061623)	0.179173 / 0.737135 (-0.557962)	0.119734 / 0.296338 (-0.176604)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.397195 / 0.215209 (0.181986)	3.959945 / 2.077655 (1.882290)	1.794059 / 1.504120 (0.289939)	1.606814 / 1.541195 (0.065619)	1.674681 / 1.468490 (0.206191)	0.680130 / 4.584777 (-3.904646)	3.742730 / 3.745712 (-0.002982)	2.021793 / 5.269862 (-3.248069)	1.322726 / 4.565676 (-3.242950)	0.084519 / 0.424275 (-0.339756)	0.012012 / 0.007607 (0.004405)	0.510076 / 0.226044 (0.284032)	5.084163 / 2.268929 (2.815234)	2.241032 / 55.444624 (-53.203592)	1.911936 / 6.876477 (-4.964540)	1.947992 / 2.142072 (-0.194080)	0.838779 / 4.805227 (-3.966448)	0.165103 / 6.500664 (-6.335561)	0.060722 / 0.075469 (-0.014747)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.180274 / 1.841788 (-0.661514)	14.285364 / 8.074308 (6.211056)	12.941205 / 10.191392 (2.749813)	0.153815 / 0.680424 (-0.526609)	0.028554 / 0.534201 (-0.505647)	0.441551 / 0.579283 (-0.137732)	0.434906 / 0.434364 (0.000542)	0.516120 / 0.540337 (-0.024217)	0.603062 / 1.386936 (-0.783874)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007287 / 0.011353 (-0.004066)	0.004998 / 0.011008 (-0.006010)	0.074997 / 0.038508 (0.036489)	0.033209 / 0.023109 (0.010100)	0.336836 / 0.275898 (0.060938)	0.365562 / 0.323480 (0.042082)	0.005739 / 0.007986 (-0.002246)	0.003942 / 0.004328 (-0.000387)	0.074681 / 0.004250 (0.070430)	0.049530 / 0.037052 (0.012478)	0.335642 / 0.258489 (0.077153)	0.388874 / 0.293841 (0.095033)	0.037198 / 0.128546 (-0.091349)	0.011983 / 0.075646 (-0.063664)	0.087601 / 0.419271 (-0.331671)	0.053761 / 0.043533 (0.010228)	0.334142 / 0.255139 (0.079003)	0.351348 / 0.283200 (0.068148)	0.107462 / 0.141683 (-0.034221)	1.497015 / 1.452155 (0.044860)	1.608287 / 1.492716 (0.115571)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.255395 / 0.018006 (0.237389)	0.439141 / 0.000490 (0.438651)	0.021391 / 0.000200 (0.021191)	0.000230 / 0.000054 (0.000176)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028331 / 0.037411 (-0.009080)	0.108744 / 0.014526 (0.094218)	0.118201 / 0.176557 (-0.058355)	0.189556 / 0.737135 (-0.547579)	0.123112 / 0.296338 (-0.173226)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.431394 / 0.215209 (0.216185)	4.296121 / 2.077655 (2.218466)	2.126371 / 1.504120 (0.622251)	1.978178 / 1.541195 (0.436983)	2.082674 / 1.468490 (0.614184)	0.701789 / 4.584777 (-3.882988)	3.791495 / 3.745712 (0.045783)	2.115267 / 5.269862 (-3.154594)	1.342159 / 4.565676 (-3.223517)	0.088132 / 0.424275 (-0.336143)	0.011903 / 0.007607 (0.004295)	0.528398 / 0.226044 (0.302354)	5.270077 / 2.268929 (3.001148)	2.498860 / 55.444624 (-52.945765)	2.155515 / 6.876477 (-4.720962)	2.192866 / 2.142072 (0.050793)	0.859596 / 4.805227 (-3.945631)	0.170544 / 6.500664 (-6.330120)	0.063883 / 0.075469 (-0.011587)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.240679 / 1.841788 (-0.601109)	14.497379 / 8.074308 (6.423071)	12.881417 / 10.191392 (2.690025)	0.147295 / 0.680424 (-0.533129)	0.017465 / 0.534201 (-0.516736)	0.424695 / 0.579283 (-0.154588)	0.414929 / 0.434364 (-0.019435)	0.536079 / 0.540337 (-0.004259)	0.638245 / 1.386936 (-0.748691)

github-actions · 2023-02-28T15:37:22Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008806 / 0.011353 (-0.002547)	0.004712 / 0.011008 (-0.006297)	0.102383 / 0.038508 (0.063875)	0.030260 / 0.023109 (0.007151)	0.330175 / 0.275898 (0.054277)	0.376816 / 0.323480 (0.053337)	0.008065 / 0.007986 (0.000079)	0.003534 / 0.004328 (-0.000794)	0.078824 / 0.004250 (0.074573)	0.036704 / 0.037052 (-0.000349)	0.331848 / 0.258489 (0.073359)	0.351031 / 0.293841 (0.057190)	0.033406 / 0.128546 (-0.095140)	0.011543 / 0.075646 (-0.064103)	0.322114 / 0.419271 (-0.097157)	0.041249 / 0.043533 (-0.002284)	0.309413 / 0.255139 (0.054274)	0.329156 / 0.283200 (0.045956)	0.088636 / 0.141683 (-0.053047)	1.508226 / 1.452155 (0.056071)	1.557203 / 1.492716 (0.064487)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.196696 / 0.018006 (0.178690)	0.426360 / 0.000490 (0.425870)	0.001263 / 0.000200 (0.001064)	0.000079 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023747 / 0.037411 (-0.013664)	0.100756 / 0.014526 (0.086230)	0.105817 / 0.176557 (-0.070739)	0.172573 / 0.737135 (-0.564562)	0.110705 / 0.296338 (-0.185634)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.436913 / 0.215209 (0.221704)	4.365753 / 2.077655 (2.288099)	2.201346 / 1.504120 (0.697226)	1.978800 / 1.541195 (0.437605)	1.951585 / 1.468490 (0.483094)	0.699208 / 4.584777 (-3.885569)	3.381492 / 3.745712 (-0.364220)	2.966174 / 5.269862 (-2.303687)	1.487521 / 4.565676 (-3.078156)	0.082673 / 0.424275 (-0.341602)	0.012436 / 0.007607 (0.004829)	0.553276 / 0.226044 (0.327232)	5.554081 / 2.268929 (3.285153)	2.653286 / 55.444624 (-52.791339)	2.404788 / 6.876477 (-4.471689)	2.484610 / 2.142072 (0.342537)	0.817073 / 4.805227 (-3.988154)	0.151619 / 6.500664 (-6.349045)	0.068259 / 0.075469 (-0.007210)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.273481 / 1.841788 (-0.568306)	13.908825 / 8.074308 (5.834517)	13.106695 / 10.191392 (2.915303)	0.139609 / 0.680424 (-0.540815)	0.028425 / 0.534201 (-0.505776)	0.395626 / 0.579283 (-0.183657)	0.405526 / 0.434364 (-0.028838)	0.465628 / 0.540337 (-0.074709)	0.542824 / 1.386936 (-0.844112)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006821 / 0.011353 (-0.004532)	0.004570 / 0.011008 (-0.006438)	0.076568 / 0.038508 (0.038060)	0.028109 / 0.023109 (0.004999)	0.342768 / 0.275898 (0.066870)	0.390680 / 0.323480 (0.067200)	0.005056 / 0.007986 (-0.002930)	0.003359 / 0.004328 (-0.000970)	0.075835 / 0.004250 (0.071584)	0.038888 / 0.037052 (0.001836)	0.343489 / 0.258489 (0.085000)	0.400766 / 0.293841 (0.106925)	0.031816 / 0.128546 (-0.096730)	0.011637 / 0.075646 (-0.064009)	0.085474 / 0.419271 (-0.333797)	0.041740 / 0.043533 (-0.001793)	0.342501 / 0.255139 (0.087362)	0.377467 / 0.283200 (0.094267)	0.091532 / 0.141683 (-0.050151)	1.457368 / 1.452155 (0.005213)	1.537187 / 1.492716 (0.044471)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.187507 / 0.018006 (0.169501)	0.415706 / 0.000490 (0.415217)	0.001816 / 0.000200 (0.001616)	0.000072 / 0.000054 (0.000018)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026251 / 0.037411 (-0.011161)	0.106609 / 0.014526 (0.092083)	0.109822 / 0.176557 (-0.066735)	0.180462 / 0.737135 (-0.556674)	0.114647 / 0.296338 (-0.181691)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.438804 / 0.215209 (0.223595)	4.387960 / 2.077655 (2.310306)	2.056804 / 1.504120 (0.552684)	1.848584 / 1.541195 (0.307389)	1.939470 / 1.468490 (0.470980)	0.702539 / 4.584777 (-3.882238)	3.419535 / 3.745712 (-0.326177)	1.933889 / 5.269862 (-3.335973)	1.189631 / 4.565676 (-3.376045)	0.084105 / 0.424275 (-0.340170)	0.012520 / 0.007607 (0.004913)	0.538125 / 0.226044 (0.312081)	5.370000 / 2.268929 (3.101072)	2.497487 / 55.444624 (-52.947137)	2.156054 / 6.876477 (-4.720423)	2.225909 / 2.142072 (0.083837)	0.811456 / 4.805227 (-3.993771)	0.151461 / 6.500664 (-6.349203)	0.066940 / 0.075469 (-0.008530)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.301246 / 1.841788 (-0.540542)	14.459755 / 8.074308 (6.385447)	13.147151 / 10.191392 (2.955759)	0.129236 / 0.680424 (-0.551188)	0.016427 / 0.534201 (-0.517774)	0.380047 / 0.579283 (-0.199236)	0.392217 / 0.434364 (-0.042147)	0.470338 / 0.540337 (-0.069999)	0.559800 / 1.386936 (-0.827136)

add precommit config

c2f8baf

lhoestq approved these changes Feb 21, 2023

View reviewed changes

lhoestq requested a review from albertvillanova February 21, 2023 17:54

nateraw approved these changes Feb 21, 2023

View reviewed changes

alvarobartt reviewed Feb 22, 2023

View reviewed changes

.pre-commit-config.yaml Show resolved Hide resolved

polinaeterna added 3 commits February 23, 2023 13:44

Merge branch 'huggingface:main' into add-precommit-config

cf9a151

skip running hooks if no matched files found

0891c39

fix instructions for using pre-commit

73ae01d

polinaeterna merged commit a940972 into huggingface:main Feb 23, 2023

polinaeterna deleted the add-precommit-config branch February 23, 2023 18:49

Add pre-commit config yaml file to enable automatic code formatting #5561

Add pre-commit config yaml file to enable automatic code formatting #5561

Conversation

polinaeterna commented Feb 21, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Feb 21, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

nateraw left a comment

Choose a reason for hiding this comment

Skylion007 commented Feb 21, 2023

polinaeterna commented Feb 23, 2023

github-actions bot commented Feb 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Feb 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Feb 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

polinaeterna commented Feb 21, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 21, 2023 •

edited

Loading