Avoid saving sparse ChunkedArrays in pyarrow tables #5542

marioga · 2023-02-17T01:52:38Z

Fixes #5541

HuggingFaceDocBuilderDev · 2023-02-17T01:58:40Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Good catch ! Thanks a lot for the fix :)

This fix is pretty important so we'll do a new release soon

github-actions · 2023-02-17T11:19:25Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008452 / 0.011353 (-0.002901)	0.004500 / 0.011008 (-0.006508)	0.100103 / 0.038508 (0.061595)	0.029395 / 0.023109 (0.006286)	0.297740 / 0.275898 (0.021842)	0.359132 / 0.323480 (0.035652)	0.007045 / 0.007986 (-0.000941)	0.003415 / 0.004328 (-0.000913)	0.076389 / 0.004250 (0.072138)	0.036612 / 0.037052 (-0.000440)	0.308773 / 0.258489 (0.050284)	0.345701 / 0.293841 (0.051860)	0.033230 / 0.128546 (-0.095317)	0.011463 / 0.075646 (-0.064183)	0.322382 / 0.419271 (-0.096890)	0.041194 / 0.043533 (-0.002339)	0.300685 / 0.255139 (0.045546)	0.323076 / 0.283200 (0.039876)	0.087330 / 0.141683 (-0.054353)	1.508661 / 1.452155 (0.056506)	1.531776 / 1.492716 (0.039059)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.188391 / 0.018006 (0.170385)	0.400102 / 0.000490 (0.399612)	0.002006 / 0.000200 (0.001806)	0.000075 / 0.000054 (0.000021)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023232 / 0.037411 (-0.014179)	0.097313 / 0.014526 (0.082787)	0.106244 / 0.176557 (-0.070313)	0.141180 / 0.737135 (-0.595955)	0.107871 / 0.296338 (-0.188468)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.418610 / 0.215209 (0.203400)	4.162243 / 2.077655 (2.084588)	1.884300 / 1.504120 (0.380180)	1.694197 / 1.541195 (0.153002)	1.727740 / 1.468490 (0.259250)	0.692129 / 4.584777 (-3.892648)	3.364230 / 3.745712 (-0.381482)	1.871507 / 5.269862 (-3.398355)	1.261520 / 4.565676 (-3.304156)	0.083258 / 0.424275 (-0.341017)	0.012479 / 0.007607 (0.004872)	0.528802 / 0.226044 (0.302757)	5.281029 / 2.268929 (3.012100)	2.402222 / 55.444624 (-53.042403)	2.064954 / 6.876477 (-4.811522)	2.027044 / 2.142072 (-0.115029)	0.813124 / 4.805227 (-3.992103)	0.149397 / 6.500664 (-6.351267)	0.065032 / 0.075469 (-0.010437)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.239192 / 1.841788 (-0.602595)	13.529913 / 8.074308 (5.455605)	14.253251 / 10.191392 (4.061859)	0.165145 / 0.680424 (-0.515278)	0.028367 / 0.534201 (-0.505834)	0.395121 / 0.579283 (-0.184162)	0.405372 / 0.434364 (-0.028992)	0.472201 / 0.540337 (-0.068137)	0.560620 / 1.386936 (-0.826316)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006368 / 0.011353 (-0.004985)	0.004542 / 0.011008 (-0.006466)	0.076361 / 0.038508 (0.037853)	0.026893 / 0.023109 (0.003784)	0.341210 / 0.275898 (0.065312)	0.378377 / 0.323480 (0.054898)	0.004833 / 0.007986 (-0.003153)	0.003358 / 0.004328 (-0.000970)	0.075516 / 0.004250 (0.071265)	0.038841 / 0.037052 (0.001788)	0.342230 / 0.258489 (0.083741)	0.384317 / 0.293841 (0.090476)	0.031874 / 0.128546 (-0.096672)	0.011651 / 0.075646 (-0.063995)	0.085816 / 0.419271 (-0.333455)	0.042389 / 0.043533 (-0.001144)	0.340678 / 0.255139 (0.085539)	0.367441 / 0.283200 (0.084241)	0.089748 / 0.141683 (-0.051935)	1.487358 / 1.452155 (0.035203)	1.615049 / 1.492716 (0.122333)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.220933 / 0.018006 (0.202926)	0.397162 / 0.000490 (0.396673)	0.002336 / 0.000200 (0.002136)	0.000069 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025004 / 0.037411 (-0.012407)	0.100877 / 0.014526 (0.086351)	0.110624 / 0.176557 (-0.065932)	0.152042 / 0.737135 (-0.585094)	0.112951 / 0.296338 (-0.183388)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.441071 / 0.215209 (0.225862)	4.419471 / 2.077655 (2.341817)	2.082976 / 1.504120 (0.578856)	1.884023 / 1.541195 (0.342828)	1.950590 / 1.468490 (0.482100)	0.706104 / 4.584777 (-3.878673)	3.329825 / 3.745712 (-0.415887)	1.868850 / 5.269862 (-3.401011)	1.178785 / 4.565676 (-3.386892)	0.083910 / 0.424275 (-0.340365)	0.012296 / 0.007607 (0.004689)	0.542998 / 0.226044 (0.316953)	5.429944 / 2.268929 (3.161015)	2.502285 / 55.444624 (-52.942339)	2.150507 / 6.876477 (-4.725970)	2.170492 / 2.142072 (0.028420)	0.813410 / 4.805227 (-3.991817)	0.152310 / 6.500664 (-6.348354)	0.066999 / 0.075469 (-0.008470)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.290839 / 1.841788 (-0.550949)	14.089491 / 8.074308 (6.015183)	13.704922 / 10.191392 (3.513530)	0.130089 / 0.680424 (-0.550335)	0.017000 / 0.534201 (-0.517201)	0.381173 / 0.579283 (-0.198110)	0.389271 / 0.434364 (-0.045093)	0.461700 / 0.540337 (-0.078637)	0.556428 / 1.386936 (-0.830508)

…ace#5542)" This reverts commit b142d20.

Avoid saving sparse ChunkedArrays in pyarrow tables

9c27eb9

marioga mentioned this pull request Feb 17, 2023

Flattening indices in selected datasets is extremely inefficient #5541

Closed

lhoestq approved these changes Feb 17, 2023

View reviewed changes

lhoestq merged commit 2cfa9be into huggingface:main Feb 17, 2023

marioga deleted the optimize_flatten_with_indices branch February 17, 2023 19:20

AJDERS pushed a commit to AJDERS/datasets that referenced this pull request Feb 21, 2023

Avoid saving sparse ChunkedArrays in pyarrow tables (huggingface#5542)

b142d20

AJDERS added a commit to AJDERS/datasets that referenced this pull request Feb 21, 2023

Revert "Avoid saving sparse ChunkedArrays in pyarrow tables (huggingf…

c14d283

…ace#5542)" This reverts commit b142d20.

westonpace mentioned this pull request Apr 26, 2023

[Python] ArrowNotImplementedError: concatenation of extension<arrow.py_extension_type<Array2DExtensionType>> apache/arrow#34455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid saving sparse ChunkedArrays in pyarrow tables #5542

Avoid saving sparse ChunkedArrays in pyarrow tables #5542

marioga commented Feb 17, 2023

HuggingFaceDocBuilderDev commented Feb 17, 2023 •

edited

Loading

lhoestq left a comment

github-actions bot commented Feb 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Avoid saving sparse ChunkedArrays in pyarrow tables #5542

Avoid saving sparse ChunkedArrays in pyarrow tables #5542

Conversation

marioga commented Feb 17, 2023

HuggingFaceDocBuilderDev commented Feb 17, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Feb 17, 2023 •

edited

Loading