Add polars compatibility #6531

psmyth94 · 2023-12-24T20:03:23Z

Hey there,

I've just finished adding support to convert and format to polars.DataFrame. This was in response to the open issue about integrating Polars #3334. Datasets can be switched to Polars format via Dataset.set_format("polars"). I've also included to_polars and from_polars. All polars functions are checked via config.POLARS_AVAILABLE.

A few notes:
This only supports DataFrames and not LazyFrames. This probably could be integrated fairly easily via is_lazy args in set_format, and to_polars.

Let me know your feedbacks.

…ts into add-polars-compatibility

lhoestq · 2024-03-01T16:49:36Z

Hi ! thanks for adding polars support :)

You added from_polars in arrow_dataset.py but not to_polars, is this on purpose ?

Also no need to touch table.py imo, which is for arrow-only logic (tables are just wrappers of pyarrow.Table with the exact same methods + optimization to existing methods + separation between in-memory and memory-mapped)

… logic-only

HuggingFaceDocBuilderDev · 2024-03-06T22:51:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ts into add-polars-compatibility

psmyth94 · 2024-03-07T04:56:49Z

Hi @lhoestq, thanks for pointing out the missing to_polars method.

I see your point about table.py so I removed them.

I also added tests in test_arrow_dataset.py, test_dataset_dict.py, and test_formatting.py. Let me know if I am missing any.

lhoestq

Thanks ! Can you addd polars to the test dependencies in setup.py ? This way your tests will be run in the CI

I also added a few more comments:

src/datasets/arrow_dataset.py

lhoestq

This should fix the CI :)

src/datasets/arrow_dataset.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

…ts into add-polars-compatibility

setup.py

lhoestq

Ah our beloved Windows doesn't seem to be properly handled, I added suggestions ti try to fix the Windows CI:

tests/test_arrow_dataset.py

psmyth94 · 2024-03-08T14:52:13Z

duckdb index files were deleted yesterday in dataset_with_script@ref/convert/parquet so I changed the hash to reflect the new SHA.

lhoestq

Great ! Merging now, congrats ! 🚀

github-actions · 2024-03-08T15:29:16Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004993 / 0.011353 (-0.006360)	0.003658 / 0.011008 (-0.007350)	0.063868 / 0.038508 (0.025360)	0.030022 / 0.023109 (0.006912)	0.246359 / 0.275898 (-0.029539)	0.273409 / 0.323480 (-0.050070)	0.003091 / 0.007986 (-0.004894)	0.003383 / 0.004328 (-0.000945)	0.050666 / 0.004250 (0.046415)	0.040609 / 0.037052 (0.003557)	0.267250 / 0.258489 (0.008761)	0.289823 / 0.293841 (-0.004018)	0.027635 / 0.128546 (-0.100911)	0.010786 / 0.075646 (-0.064860)	0.208442 / 0.419271 (-0.210830)	0.036627 / 0.043533 (-0.006906)	0.254116 / 0.255139 (-0.001023)	0.274368 / 0.283200 (-0.008832)	0.018222 / 0.141683 (-0.123460)	1.184472 / 1.452155 (-0.267683)	1.194309 / 1.492716 (-0.298407)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092861 / 0.018006 (0.074855)	0.304736 / 0.000490 (0.304246)	0.000219 / 0.000200 (0.000019)	0.000175 / 0.000054 (0.000121)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019378 / 0.037411 (-0.018034)	0.062342 / 0.014526 (0.047817)	0.074107 / 0.176557 (-0.102450)	0.121746 / 0.737135 (-0.615390)	0.075657 / 0.296338 (-0.220681)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.286474 / 0.215209 (0.071265)	2.832043 / 2.077655 (0.754389)	1.453520 / 1.504120 (-0.050600)	1.324714 / 1.541195 (-0.216480)	1.335439 / 1.468490 (-0.133051)	0.571753 / 4.584777 (-4.013024)	2.427361 / 3.745712 (-1.318352)	2.899838 / 5.269862 (-2.370024)	1.775754 / 4.565676 (-2.789922)	0.064177 / 0.424275 (-0.360098)	0.004978 / 0.007607 (-0.002629)	0.343585 / 0.226044 (0.117541)	3.368494 / 2.268929 (1.099565)	1.819825 / 55.444624 (-53.624800)	1.502633 / 6.876477 (-5.373844)	1.549182 / 2.142072 (-0.592891)	0.658245 / 4.805227 (-4.146983)	0.120052 / 6.500664 (-6.380612)	0.043051 / 0.075469 (-0.032419)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.977055 / 1.841788 (-0.864733)	11.595567 / 8.074308 (3.521259)	9.450951 / 10.191392 (-0.740441)	0.141060 / 0.680424 (-0.539364)	0.014359 / 0.534201 (-0.519842)	0.289938 / 0.579283 (-0.289345)	0.266035 / 0.434364 (-0.168329)	0.326802 / 0.540337 (-0.213536)	0.431913 / 1.386936 (-0.955023)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005391 / 0.011353 (-0.005961)	0.003724 / 0.011008 (-0.007284)	0.050432 / 0.038508 (0.011924)	0.029904 / 0.023109 (0.006794)	0.270870 / 0.275898 (-0.005028)	0.296773 / 0.323480 (-0.026706)	0.004265 / 0.007986 (-0.003721)	0.002751 / 0.004328 (-0.001577)	0.050366 / 0.004250 (0.046116)	0.046415 / 0.037052 (0.009363)	0.283272 / 0.258489 (0.024783)	0.320188 / 0.293841 (0.026347)	0.029827 / 0.128546 (-0.098719)	0.010736 / 0.075646 (-0.064910)	0.059541 / 0.419271 (-0.359731)	0.057080 / 0.043533 (0.013548)	0.270653 / 0.255139 (0.015514)	0.291235 / 0.283200 (0.008035)	0.018590 / 0.141683 (-0.123093)	1.129402 / 1.452155 (-0.322752)	1.194499 / 1.492716 (-0.298217)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.102220 / 0.018006 (0.084214)	0.302176 / 0.000490 (0.301686)	0.000229 / 0.000200 (0.000029)	0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022809 / 0.037411 (-0.014602)	0.076054 / 0.014526 (0.061528)	0.087466 / 0.176557 (-0.089091)	0.128495 / 0.737135 (-0.608640)	0.089933 / 0.296338 (-0.206406)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.296546 / 0.215209 (0.081337)	2.898693 / 2.077655 (0.821039)	1.605002 / 1.504120 (0.100883)	1.468370 / 1.541195 (-0.072825)	1.503541 / 1.468490 (0.035051)	0.577233 / 4.584777 (-4.007544)	2.460154 / 3.745712 (-1.285558)	2.755651 / 5.269862 (-2.514211)	1.777711 / 4.565676 (-2.787966)	0.063137 / 0.424275 (-0.361138)	0.005056 / 0.007607 (-0.002551)	0.350189 / 0.226044 (0.124145)	3.485473 / 2.268929 (1.216545)	1.952553 / 55.444624 (-53.492072)	1.669108 / 6.876477 (-5.207369)	1.788504 / 2.142072 (-0.353569)	0.672869 / 4.805227 (-4.132359)	0.117717 / 6.500664 (-6.382948)	0.040499 / 0.075469 (-0.034970)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.048187 / 1.841788 (-0.793601)	12.663229 / 8.074308 (4.588921)	10.316487 / 10.191392 (0.125095)	0.142537 / 0.680424 (-0.537887)	0.016024 / 0.534201 (-0.518177)	0.292735 / 0.579283 (-0.286548)	0.273294 / 0.434364 (-0.161069)	0.327636 / 0.540337 (-0.212701)	0.443062 / 1.386936 (-0.943874)

lhoestq · 2024-03-08T16:18:26Z

I'm so excited I tweeted about it: https://x.com/qlhoest/status/1766135995513082086?s=20 I hope it's fine !

psmyth94 · 2024-03-08T19:29:23Z

Thanks @lhoestq for the support and totally fine with the share! Happy to see people excited for this 😃

Patrick Smyth added 5 commits December 24, 2023 13:22

Add Polars support for data formatting and conversion

83d1fbb

Update Polars availability check in config.py

fec2bb3

Merge branch 'add-polars-compatibility' of github.com:psmyth94/datase…

622e54e

…ts into add-polars-compatibility

Merge branch 'add-polars-compatibility' of github.com:psmyth94/datase…

82b9b7c

…ts into add-polars-compatibility

Merge branch 'add-polars-compatibility' of github.com:psmyth94/datase…

aae3f5a

…ts into add-polars-compatibility

Patrick Smyth and others added 5 commits March 6, 2024 16:23

added to_polars

7374e99

changed the logic of importing polars if not already called

408b9d6

Remove to and from_polars from table.py in order to maintain pa.table…

39a5c56

… logic-only

Merge branch 'main' into add-polars-compatibility

3aa7081

fix unused import

2f10384

Merge branch 'add-polars-compatibility' of github.com:psmyth94/datase…

a623f51

…ts into add-polars-compatibility

psmyth94 closed this Mar 6, 2024

psmyth94 reopened this Mar 6, 2024

Patrick Smyth added 7 commits March 6, 2024 17:01

fixed code formatting with ruff

12fef57

fix formatting issues with ruff

a57fcbe

fix formatting issues using ruff

912c437

add tests for polars formatting

ce7c3c5

removed using InMemoryTable classmethod to convert polars to Table

1b28d85

added test for polars conversion

eb4d7ce

added missing ruff fixes

417f9ad

lhoestq reviewed Mar 7, 2024

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

Patrick Smyth and others added 3 commits March 7, 2024 07:55

add polars in test dependencies

19e5d80

Fixed not executing default write method due to nested polars check.

7c835a4

Merge branch 'main' into add-polars-compatibility

d0582f9

lhoestq reviewed Mar 7, 2024

View reviewed changes

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

src/datasets/arrow_dataset.py Outdated Show resolved Hide resolved

psmyth94 and others added 2 commits March 7, 2024 10:31

Update src/datasets/arrow_dataset.py

fa51fd2

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update src/datasets/arrow_dataset.py

d09839c

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Patrick Smyth added 6 commits March 7, 2024 10:35

Fix Polars DataFrame conversion bug

1c6d2a5

Merge branch 'add-polars-compatibility' of github.com:psmyth94/datase…

40614ee

…ts into add-polars-compatibility

Fix DataFrame conversion in arrow_dataset.py

7d6224b

Fix variable name in arrow_dataset.py

a301eb3

Fix write_table to write_row in Dataset class

d062b57

fix formatting with ruff

1dbdc80

lhoestq reviewed Mar 7, 2024

View reviewed changes

setup.py Outdated Show resolved Hide resolved

Patrick Smyth added 2 commits March 7, 2024 11:39

Update polars dependency to include timezone support

1b9e450

Remove polars in EXTRAS_REQUIRE

23329d5

lhoestq reviewed Mar 7, 2024

View reviewed changes

tests/test_arrow_dataset.py Show resolved Hide resolved

tests/test_arrow_dataset.py Show resolved Hide resolved

Patrick Smyth and others added 11 commits March 7, 2024 12:57

Replace deprecated method

f4361cc

perform cleanup after use

53f471a

Merge branch 'main' into add-polars-compatibility

52fd448

remove unused import

b898b57

Add garbage collection to test_to_polars method

4bacf3b

Remove unused import and unnecessary code in test_to_polars method

358b2cb

Add additional args for to_polars method

c00efad

Fixed unclosed links to dataset file

ddaab5b

ruff cleanup

a87998c

even ruffier cleanup

d1acc92

changed hash to reflect new SHA for ref/convert/parquet

7ee3fdf

lhoestq approved these changes Mar 8, 2024

View reviewed changes

lhoestq merged commit 90b8961 into huggingface:main Mar 8, 2024
12 checks passed

psmyth94 deleted the add-polars-compatibility branch March 8, 2024 15:59

loicmagne mentioned this pull request Apr 23, 2024

Slow loading for datasets with a high number of language pairs embeddings-benchmark/mteb#530

Closed

albertvillanova linked an issue Aug 31, 2024 that may be closed by this pull request

Integrate Polars library #3334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add polars compatibility #6531

Add polars compatibility #6531

psmyth94 commented Dec 24, 2023

lhoestq commented Mar 1, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 6, 2024

psmyth94 commented Mar 7, 2024

lhoestq left a comment

lhoestq left a comment

lhoestq left a comment

psmyth94 commented Mar 8, 2024

lhoestq left a comment

github-actions bot commented Mar 8, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Mar 8, 2024

psmyth94 commented Mar 8, 2024

Add polars compatibility #6531

Add polars compatibility #6531

Conversation

psmyth94 commented Dec 24, 2023

lhoestq commented Mar 1, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Mar 6, 2024

psmyth94 commented Mar 7, 2024

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq left a comment

Choose a reason for hiding this comment

psmyth94 commented Mar 8, 2024

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 8, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Mar 8, 2024

psmyth94 commented Mar 8, 2024

lhoestq commented Mar 1, 2024 •

edited

Loading