Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support delta lake reader #221

Open
dseynaev opened this issue May 15, 2023 · 7 comments
Open

support delta lake reader #221

dseynaev opened this issue May 15, 2023 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed reader Anything related to reading data upstream

Comments

@dseynaev
Copy link

polars seems to support it but it's implemented on the python side: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_delta.html

the underlying delta lake interface lib is written in Rust though: https://docs.rs/deltalake/latest/deltalake/

@sorhawell
Copy link
Collaborator

sorhawell commented May 15, 2023

The current py-polars implementation:

deltalake.read_table() ->
deltalake_tbl.to_pyarrow() ->
polars.from_arrow() ->
polars_table
outer py-function
to_pyarrow py

delta-rs has first class support from python.

A potential r-polars pathway via rust api could be:

Read with delta-rs (I'm not sure if this could work out of the box with any cloudstorage uri): https://docs.rs/deltalake/0.11.0/deltalake/delta/fn.open_table.html

make a record-batch-reader with delta-rs: https://docs.rs/deltalake/0.11.0/deltalake/table_state/struct.DeltaTableState.html#method.add_actions_table

import from a record batch reader to r-polars via arrow2-rs...

@sorhawell
Copy link
Collaborator

sorhawell commented May 15, 2023

@sorhawell
Copy link
Collaborator

sorhawell commented May 15, 2023

Hi @wjones127 can I ask, do you think it is realistic to make a minimal data-lake reader for r-polars via delta-rs rust-api and arrow2 ? Or is there some filesystem magic from python which is also needed?

@wjones127
Copy link

I don't think filesystems are a blocker there; you can use the object stores that come with delta-rs.

But, especially if you are using arrow2, there's no ready-to-use scan function in delta-rs that you could plug into, so there's quite a bit of code you would have to read. Currently in the python package, delta-rs provides the file list and their statistics, and then the Python package provides the actual file scanners through PyArrow. Eventually, we'll have the scanner available in delta-rs and then it will be a lot easier to implement the R package, but that will take time.

@dseynaev
Copy link
Author

dseynaev commented May 17, 2023

@sorhawell @wjones127 myself and @Ploppz might have some capacity to investigate/contribute but will need some pointers/guidance

would it be helpful to connect over Discord?

@sorhawell
Copy link
Collaborator

@dseynaev sure :) what discord channel do you prefer? it could be the r-polars subchannel of polars discord

One stepping stone would be an interface for r-arrow dataset, then r-polars must a make a scanner-adaptor to that. It will take a week or two for me to write I think, but very parallel to the py-polars/py-arrow interface. Then would be to good reasons to go ahead with #165

@etiennebacher etiennebacher added enhancement New feature or request reader Anything related to reading data labels Jun 28, 2023
@eitsupi eitsupi added the help wanted Extra attention is needed label Mar 10, 2024
@eitsupi
Copy link
Collaborator

eitsupi commented Jun 27, 2024

Waiting for pola-rs/polars#17244

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed reader Anything related to reading data upstream
Projects
None yet
Development

No branches or pull requests

5 participants