feat: support non streamable arrow file binary format #7025

kmehant · 2024-07-04T10:11:12Z

Support Arrow files (.arrow) that are in non streamable binary file formats.

kmehant · 2024-07-04T17:44:13Z

requesting review - @albertvillanova @lhoestq

src/datasets/packaged_modules/arrow/arrow.py

lhoestq

Awesome thank you ! this will be pretty useful :)

Before we merge could you also add a test in tests/packaged_modules/test_arrow.py ?

I noticed it's pretty empty right now compared to test_json.py or test_csv.py though, maybe I can take care of it next week if needed

src/datasets/packaged_modules/arrow/arrow.py

HuggingFaceDocBuilderDev · 2024-07-09T11:25:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kmehant · 2024-07-17T17:15:45Z

@lhoestq rebased the PR, It would be really helpful to have this feature into datasets, please let me know if there is anything pending on this PR, thanks.

kmehant · 2024-07-25T13:35:17Z

@lhoestq

Have added the unit test to generate tables for both the arrow formats - file and streaming.

Let me know if we have any docs changes as well. Thanks

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant · 2024-07-29T13:03:29Z

@lhoestq any update on this thread? Thanks

prince14322 · 2024-07-31T04:45:27Z

Timely PR!
Can we please look into this?

albertvillanova

Thank you for the useful enhancement and the test!

github-actions · 2024-07-31T06:15:49Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005737 / 0.011353 (-0.005615)	0.003894 / 0.011008 (-0.007114)	0.067510 / 0.038508 (0.029002)	0.033431 / 0.023109 (0.010321)	0.262766 / 0.275898 (-0.013132)	0.283776 / 0.323480 (-0.039704)	0.003296 / 0.007986 (-0.004689)	0.003577 / 0.004328 (-0.000752)	0.052165 / 0.004250 (0.047915)	0.047815 / 0.037052 (0.010763)	0.263528 / 0.258489 (0.005039)	0.292980 / 0.293841 (-0.000861)	0.031535 / 0.128546 (-0.097011)	0.012966 / 0.075646 (-0.062680)	0.218827 / 0.419271 (-0.200444)	0.039181 / 0.043533 (-0.004352)	0.263768 / 0.255139 (0.008629)	0.288012 / 0.283200 (0.004813)	0.020562 / 0.141683 (-0.121121)	1.180547 / 1.452155 (-0.271608)	1.269283 / 1.492716 (-0.223433)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.098951 / 0.018006 (0.080944)	0.318922 / 0.000490 (0.318433)	0.000214 / 0.000200 (0.000014)	0.000044 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021315 / 0.037411 (-0.016097)	0.067728 / 0.014526 (0.053202)	0.079428 / 0.176557 (-0.097129)	0.127472 / 0.737135 (-0.609663)	0.080455 / 0.296338 (-0.215883)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.308725 / 0.215209 (0.093516)	3.043555 / 2.077655 (0.965900)	1.587419 / 1.504120 (0.083299)	1.444421 / 1.541195 (-0.096774)	1.470703 / 1.468490 (0.002213)	0.784005 / 4.584777 (-3.800772)	2.582064 / 3.745712 (-1.163648)	3.140269 / 5.269862 (-2.129592)	2.031099 / 4.565676 (-2.534577)	0.086999 / 0.424275 (-0.337277)	0.005923 / 0.007607 (-0.001684)	0.361333 / 0.226044 (0.135289)	3.587173 / 2.268929 (1.318244)	1.961448 / 55.444624 (-53.483177)	1.649868 / 6.876477 (-5.226609)	1.698595 / 2.142072 (-0.443478)	0.858552 / 4.805227 (-3.946676)	0.146001 / 6.500664 (-6.354663)	0.046049 / 0.075469 (-0.029421)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.022644 / 1.841788 (-0.819144)	12.655994 / 8.074308 (4.581686)	10.205832 / 10.191392 (0.014440)	0.156073 / 0.680424 (-0.524351)	0.015550 / 0.534201 (-0.518651)	0.327762 / 0.579283 (-0.251521)	0.299212 / 0.434364 (-0.135152)	0.367549 / 0.540337 (-0.172788)	0.474499 / 1.386936 (-0.912437)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005904 / 0.011353 (-0.005448)	0.004245 / 0.011008 (-0.006763)	0.054309 / 0.038508 (0.015801)	0.037490 / 0.023109 (0.014381)	0.293540 / 0.275898 (0.017642)	0.324068 / 0.323480 (0.000588)	0.004675 / 0.007986 (-0.003311)	0.003091 / 0.004328 (-0.001238)	0.052972 / 0.004250 (0.048721)	0.045545 / 0.037052 (0.008493)	0.301465 / 0.258489 (0.042976)	0.342822 / 0.293841 (0.048981)	0.033958 / 0.128546 (-0.094588)	0.013311 / 0.075646 (-0.062336)	0.064050 / 0.419271 (-0.355222)	0.038127 / 0.043533 (-0.005406)	0.297383 / 0.255139 (0.042244)	0.312244 / 0.283200 (0.029044)	0.019395 / 0.141683 (-0.122288)	1.244335 / 1.452155 (-0.207820)	1.305547 / 1.492716 (-0.187169)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.101847 / 0.018006 (0.083840)	0.330827 / 0.000490 (0.330337)	0.000211 / 0.000200 (0.000011)	0.000047 / 0.000054 (-0.000008)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025734 / 0.037411 (-0.011677)	0.085020 / 0.014526 (0.070494)	0.096724 / 0.176557 (-0.079833)	0.141276 / 0.737135 (-0.595859)	0.099150 / 0.296338 (-0.197189)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.316058 / 0.215209 (0.100849)	3.059459 / 2.077655 (0.981804)	1.638394 / 1.504120 (0.134274)	1.505313 / 1.541195 (-0.035881)	1.526635 / 1.468490 (0.058145)	0.777259 / 4.584777 (-3.807518)	1.059575 / 3.745712 (-2.686137)	2.952334 / 5.269862 (-2.317528)	2.003894 / 4.565676 (-2.561782)	0.084464 / 0.424275 (-0.339811)	0.007343 / 0.007607 (-0.000265)	0.366218 / 0.226044 (0.140174)	3.705588 / 2.268929 (1.436660)	2.047029 / 55.444624 (-53.397595)	1.766970 / 6.876477 (-5.109507)	1.883804 / 2.142072 (-0.258268)	0.865780 / 4.805227 (-3.939447)	0.143180 / 6.500664 (-6.357485)	0.044943 / 0.075469 (-0.030527)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.141391 / 1.841788 (-0.700397)	13.244917 / 8.074308 (5.170609)	10.907863 / 10.191392 (0.716471)	0.156087 / 0.680424 (-0.524337)	0.016487 / 0.534201 (-0.517714)	0.331377 / 0.579283 (-0.247906)	0.148863 / 0.434364 (-0.285501)	0.370443 / 0.540337 (-0.169895)	0.499647 / 1.386936 (-0.887289)

* feat: support non streamable arrow file binary format Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> * use generator Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * feat: add unit test to load data in both arrow formats Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> --------- Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

kmehant force-pushed the support-non-streamable-arrow-files branch from 8be0e3f to c75c4c3 Compare July 4, 2024 17:44

lhoestq reviewed Jul 8, 2024

View reviewed changes

src/datasets/packaged_modules/arrow/arrow.py Outdated Show resolved Hide resolved

kmehant force-pushed the support-non-streamable-arrow-files branch 2 times, most recently from 2e3af68 to a3412c5 Compare July 9, 2024 02:21

lhoestq reviewed Jul 9, 2024

View reviewed changes

src/datasets/packaged_modules/arrow/arrow.py Outdated Show resolved Hide resolved

kmehant requested a review from lhoestq July 9, 2024 17:08

kmehant force-pushed the support-non-streamable-arrow-files branch from b497b7d to c257792 Compare July 17, 2024 17:14

kmehant force-pushed the support-non-streamable-arrow-files branch from c257792 to bd6546c Compare July 25, 2024 13:33

kmehant and others added 3 commits July 29, 2024 18:33

feat: support non streamable arrow file binary format

b3d3707

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

use generator

734883a

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

feat: add unit test to load data in both arrow formats

a30a66a

Signed-off-by: Mehant Kammakomati <mehant.kammakomati2@ibm.com>

kmehant force-pushed the support-non-streamable-arrow-files branch from bd6546c to a30a66a Compare July 29, 2024 13:03

albertvillanova approved these changes Jul 31, 2024

View reviewed changes

albertvillanova merged commit ce4a0c5 into huggingface:main Jul 31, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support non streamable arrow file binary format #7025

feat: support non streamable arrow file binary format #7025

kmehant commented Jul 4, 2024

kmehant commented Jul 4, 2024

lhoestq left a comment

HuggingFaceDocBuilderDev commented Jul 9, 2024

kmehant commented Jul 17, 2024 •

edited

Loading

kmehant commented Jul 25, 2024

kmehant commented Jul 29, 2024

prince14322 commented Jul 31, 2024

albertvillanova left a comment

github-actions bot commented Jul 31, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

feat: support non streamable arrow file binary format #7025

feat: support non streamable arrow file binary format #7025

Conversation

kmehant commented Jul 4, 2024

kmehant commented Jul 4, 2024

lhoestq left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jul 9, 2024

kmehant commented Jul 17, 2024 • edited Loading

kmehant commented Jul 25, 2024

kmehant commented Jul 29, 2024

prince14322 commented Jul 31, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 31, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

kmehant commented Jul 17, 2024 •

edited

Loading