Do no write index by default when exporting a dataset #5583

mariosasko · 2023-02-27T17:04:46Z

Ensures all the writers that use Pandas for conversion (JSON, CSV, SQL) do not export index by default (#5490 only did this for CSV)

HuggingFaceDocBuilderDev · 2023-02-27T17:09:09Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-02-27T17:09:46Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009044 / 0.011353 (-0.002309)	0.004244 / 0.011008 (-0.006765)	0.106705 / 0.038508 (0.068197)	0.029779 / 0.023109 (0.006670)	0.289684 / 0.275898 (0.013786)	0.347100 / 0.323480 (0.023620)	0.007071 / 0.007986 (-0.000915)	0.003734 / 0.004328 (-0.000595)	0.077971 / 0.004250 (0.073720)	0.035323 / 0.037052 (-0.001730)	0.334520 / 0.258489 (0.076031)	0.375804 / 0.293841 (0.081964)	0.049211 / 0.128546 (-0.079335)	0.016992 / 0.075646 (-0.058654)	0.337208 / 0.419271 (-0.082064)	0.053700 / 0.043533 (0.010167)	0.295750 / 0.255139 (0.040611)	0.330157 / 0.283200 (0.046958)	0.097017 / 0.141683 (-0.044666)	1.379353 / 1.452155 (-0.072802)	1.402670 / 1.492716 (-0.090047)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.012685 / 0.018006 (-0.005321)	0.474541 / 0.000490 (0.474051)	0.006752 / 0.000200 (0.006552)	0.000097 / 0.000054 (0.000042)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025735 / 0.037411 (-0.011676)	0.092507 / 0.014526 (0.077982)	0.100275 / 0.176557 (-0.076281)	0.180359 / 0.737135 (-0.556777)	0.104312 / 0.296338 (-0.192026)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.456558 / 0.215209 (0.241349)	4.786667 / 2.077655 (2.709012)	1.873169 / 1.504120 (0.369050)	1.640935 / 1.541195 (0.099741)	1.614543 / 1.468490 (0.146053)	0.936144 / 4.584777 (-3.648633)	4.699886 / 3.745712 (0.954174)	2.398545 / 5.269862 (-2.871317)	1.642808 / 4.565676 (-2.922868)	0.124803 / 0.424275 (-0.299472)	0.011848 / 0.007607 (0.004241)	0.631684 / 0.226044 (0.405639)	6.096052 / 2.268929 (3.827124)	2.463052 / 55.444624 (-52.981572)	1.928551 / 6.876477 (-4.947926)	1.927790 / 2.142072 (-0.214283)	1.098912 / 4.805227 (-3.706315)	0.196343 / 6.500664 (-6.304321)	0.063296 / 0.075469 (-0.012173)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.255032 / 1.841788 (-0.586755)	13.853623 / 8.074308 (5.779315)	16.303280 / 10.191392 (6.111888)	0.227287 / 0.680424 (-0.453137)	0.037527 / 0.534201 (-0.496674)	0.449345 / 0.579283 (-0.129938)	0.522054 / 0.434364 (0.087690)	0.552848 / 0.540337 (0.012511)	0.642994 / 1.386936 (-0.743942)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008470 / 0.011353 (-0.002883)	0.005167 / 0.011008 (-0.005841)	0.077794 / 0.038508 (0.039286)	0.029228 / 0.023109 (0.006119)	0.340828 / 0.275898 (0.064930)	0.400170 / 0.323480 (0.076691)	0.005485 / 0.007986 (-0.002500)	0.003854 / 0.004328 (-0.000475)	0.077597 / 0.004250 (0.073346)	0.036519 / 0.037052 (-0.000533)	0.335522 / 0.258489 (0.077033)	0.412622 / 0.293841 (0.118781)	0.044587 / 0.128546 (-0.083959)	0.016024 / 0.075646 (-0.059623)	0.092312 / 0.419271 (-0.326960)	0.055660 / 0.043533 (0.012127)	0.343140 / 0.255139 (0.088001)	0.386403 / 0.283200 (0.103203)	0.098634 / 0.141683 (-0.043049)	1.326126 / 1.452155 (-0.126029)	1.430316 / 1.492716 (-0.062400)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.222807 / 0.018006 (0.204801)	0.473622 / 0.000490 (0.473132)	0.000376 / 0.000200 (0.000176)	0.000066 / 0.000054 (0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024599 / 0.037411 (-0.012813)	0.100743 / 0.014526 (0.086217)	0.112086 / 0.176557 (-0.064471)	0.198294 / 0.737135 (-0.538842)	0.111210 / 0.296338 (-0.185129)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.494120 / 0.215209 (0.278911)	5.117958 / 2.077655 (3.040303)	2.305131 / 1.504120 (0.801011)	2.015591 / 1.541195 (0.474396)	2.027284 / 1.468490 (0.558794)	1.014241 / 4.584777 (-3.570536)	4.738836 / 3.745712 (0.993124)	2.519718 / 5.269862 (-2.750143)	1.706379 / 4.565676 (-2.859298)	0.122452 / 0.424275 (-0.301824)	0.011500 / 0.007607 (0.003893)	0.632864 / 0.226044 (0.406820)	6.295457 / 2.268929 (4.026529)	2.824897 / 55.444624 (-52.619727)	2.324359 / 6.876477 (-4.552117)	2.281046 / 2.142072 (0.138974)	1.173570 / 4.805227 (-3.631657)	0.197195 / 6.500664 (-6.303469)	0.064845 / 0.075469 (-0.010624)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.273224 / 1.841788 (-0.568563)	14.531155 / 8.074308 (6.456847)	15.892176 / 10.191392 (5.700784)	0.208051 / 0.680424 (-0.472373)	0.023119 / 0.534201 (-0.511082)	0.422317 / 0.579283 (-0.156966)	0.519946 / 0.434364 (0.085582)	0.544517 / 0.540337 (0.004179)	0.605955 / 1.386936 (-0.780981)

lhoestq

nice !

github-actions · 2023-02-28T13:52:15Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010806 / 0.011353 (-0.000547)	0.005631 / 0.011008 (-0.005378)	0.113166 / 0.038508 (0.074657)	0.042980 / 0.023109 (0.019871)	0.344856 / 0.275898 (0.068958)	0.404417 / 0.323480 (0.080938)	0.012222 / 0.007986 (0.004236)	0.004470 / 0.004328 (0.000141)	0.088072 / 0.004250 (0.083822)	0.049815 / 0.037052 (0.012763)	0.366532 / 0.258489 (0.108043)	0.392558 / 0.293841 (0.098717)	0.045411 / 0.128546 (-0.083135)	0.014118 / 0.075646 (-0.061529)	0.392894 / 0.419271 (-0.026378)	0.067713 / 0.043533 (0.024181)	0.353013 / 0.255139 (0.097874)	0.378375 / 0.283200 (0.095175)	0.123686 / 0.141683 (-0.017996)	1.665272 / 1.452155 (0.213118)	1.748383 / 1.492716 (0.255667)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.011672 / 0.018006 (-0.006335)	0.481667 / 0.000490 (0.481178)	0.003644 / 0.000200 (0.003444)	0.000092 / 0.000054 (0.000037)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030436 / 0.037411 (-0.006976)	0.122577 / 0.014526 (0.108052)	0.135409 / 0.176557 (-0.041148)	0.220385 / 0.737135 (-0.516750)	0.143140 / 0.296338 (-0.153199)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.471146 / 0.215209 (0.255937)	4.645023 / 2.077655 (2.567368)	2.126783 / 1.504120 (0.622663)	1.907905 / 1.541195 (0.366710)	1.969561 / 1.468490 (0.501071)	0.798670 / 4.584777 (-3.786107)	4.394787 / 3.745712 (0.649075)	2.353535 / 5.269862 (-2.916327)	1.501013 / 4.565676 (-3.064664)	0.097472 / 0.424275 (-0.326803)	0.014015 / 0.007607 (0.006408)	0.589365 / 0.226044 (0.363320)	5.897331 / 2.268929 (3.628402)	2.656198 / 55.444624 (-52.788427)	2.256082 / 6.876477 (-4.620395)	2.271122 / 2.142072 (0.129050)	0.961566 / 4.805227 (-3.843661)	0.188303 / 6.500664 (-6.312361)	0.073258 / 0.075469 (-0.002211)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.445266 / 1.841788 (-0.396522)	16.876710 / 8.074308 (8.802402)	16.004287 / 10.191392 (5.812895)	0.212252 / 0.680424 (-0.468172)	0.033186 / 0.534201 (-0.501015)	0.520564 / 0.579283 (-0.058719)	0.516865 / 0.434364 (0.082501)	0.638482 / 0.540337 (0.098144)	0.761959 / 1.386936 (-0.624977)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008101 / 0.011353 (-0.003252)	0.005512 / 0.011008 (-0.005497)	0.086138 / 0.038508 (0.047630)	0.038605 / 0.023109 (0.015496)	0.413082 / 0.275898 (0.137184)	0.444016 / 0.323480 (0.120536)	0.006196 / 0.007986 (-0.001790)	0.005736 / 0.004328 (0.001408)	0.086938 / 0.004250 (0.082688)	0.052307 / 0.037052 (0.015255)	0.415206 / 0.258489 (0.156717)	0.481510 / 0.293841 (0.187669)	0.041469 / 0.128546 (-0.087077)	0.013481 / 0.075646 (-0.062165)	0.101528 / 0.419271 (-0.317744)	0.056507 / 0.043533 (0.012974)	0.418166 / 0.255139 (0.163027)	0.443834 / 0.283200 (0.160634)	0.116434 / 0.141683 (-0.025249)	1.651223 / 1.452155 (0.199068)	1.746429 / 1.492716 (0.253713)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.242381 / 0.018006 (0.224375)	0.478826 / 0.000490 (0.478337)	0.000463 / 0.000200 (0.000264)	0.000067 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031743 / 0.037411 (-0.005668)	0.126141 / 0.014526 (0.111616)	0.134539 / 0.176557 (-0.042018)	0.216546 / 0.737135 (-0.520590)	0.143513 / 0.296338 (-0.152825)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.486915 / 0.215209 (0.271706)	4.833812 / 2.077655 (2.756158)	2.317785 / 1.504120 (0.813666)	2.114181 / 1.541195 (0.572986)	2.153896 / 1.468490 (0.685406)	0.797490 / 4.584777 (-3.787287)	4.369950 / 3.745712 (0.624238)	2.305492 / 5.269862 (-2.964370)	1.488860 / 4.565676 (-3.076816)	0.098071 / 0.424275 (-0.326204)	0.014129 / 0.007607 (0.006522)	0.611311 / 0.226044 (0.385266)	6.087482 / 2.268929 (3.818554)	2.837676 / 55.444624 (-52.606948)	2.451819 / 6.876477 (-4.424657)	2.456763 / 2.142072 (0.314690)	0.957637 / 4.805227 (-3.847590)	0.190974 / 6.500664 (-6.309690)	0.074497 / 0.075469 (-0.000972)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.466214 / 1.841788 (-0.375574)	17.063925 / 8.074308 (8.989617)	14.630326 / 10.191392 (4.438934)	0.170570 / 0.680424 (-0.509854)	0.023794 / 0.534201 (-0.510407)	0.509175 / 0.579283 (-0.070108)	0.506485 / 0.434364 (0.072121)	0.616965 / 0.540337 (0.076628)	0.718176 / 1.386936 (-0.668760)

Do no write index by default when exporting a dataset

337a4a9

mariosasko requested a review from lhoestq February 27, 2023 17:20

lhoestq approved these changes Feb 28, 2023

View reviewed changes

mariosasko merged commit c4f14de into main Feb 28, 2023

mariosasko deleted the export-no-index branch February 28, 2023 13:44

albertvillanova mentioned this pull request Mar 20, 2023

The index column created with .to_sql() is dependent on the batch_size when writing #5649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do no write index by default when exporting a dataset #5583

Do no write index by default when exporting a dataset #5583

mariosasko commented Feb 27, 2023

HuggingFaceDocBuilderDev commented Feb 27, 2023 •

edited

Loading

github-actions bot commented Feb 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq left a comment

github-actions bot commented Feb 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Do no write index by default when exporting a dataset #5583

Do no write index by default when exporting a dataset #5583

Conversation

mariosasko commented Feb 27, 2023

HuggingFaceDocBuilderDev commented Feb 27, 2023 • edited Loading

github-actions bot commented Feb 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Feb 27, 2023 •

edited

Loading