Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are there plans to support delta reader/writer? #2858

Closed
francisco-ltech opened this issue Mar 8, 2022 · 21 comments · Fixed by #7616
Closed

Are there plans to support delta reader/writer? #2858

francisco-ltech opened this issue Mar 8, 2022 · 21 comments · Fixed by #7616

Comments

@francisco-ltech
Copy link

francisco-ltech commented Mar 8, 2022

Any integration with delta lake in the horizon by any chance?
https://delta.io/

There is a native delta lake implementation in Rust
https://github.com/delta-io/delta-rs/tree/main/rust

@ghuls
Copy link
Collaborator

ghuls commented Mar 8, 2022

I think it will be very unlikely. As far as I can see delta lake does not use the Arrow format, but requires spark .

You can use to read the data in a pyarrow table which you then can convert to a polars dataframe.

Seems like there are 2 python packages to do it (which seem to have the same name):

https://databricks.com/blog/2020/12/22/natively-query-your-delta-lake-with-scala-java-and-python.html
https://github.com/delta-io/delta-rs/tree/main/python

import polars as pl

# Import Delta Table
from deltalake import DeltaTable

# Read the Delta Table using the Rust API
dt = DeltaTable("../rust/tests/data/simple_table")

# Create a Polars Dataframe by initially converting the Delta Lake
# table into a PyArrow table
df = pl.DataFrame(dt.to_pyarrow_table())

https://pypi.org/project/delta-lake-reader/

import polars as pl

from deltalake import DeltaTable

# native file path. Can be relative or absolute
table_path = "somepath/mytable"

# Create a Polars Dataframe by initially converting the Delta Lake
# table into a PyArrow table.
df = pl.DataFrame(DeltaTable(table_path).to_table())

@ritchie46
Copy link
Member

There were plans to do so. But delta-rs is based on arrow-rs and polars uses arrow2, so that are some difficulties.

@jorgecarleitao
Copy link
Collaborator

cc @houqp

@houqp
Copy link

houqp commented Mar 9, 2022

There is ongoing work to migrate delta-rs to arrow2 and parquet2, see: delta-io/delta-rs#465. The current branch is mostly complete except map and list type suport. We also need to update to the latest arrow2/parquet2 version :D Once the port is completed, plugging it into polars should be pretty trivia.

@andrei-ionescu
Copy link
Contributor

I'm also looking forward for the Delta Lake support!

@MrPowers
Copy link

MrPowers commented Sep 2, 2022

@ritchie46 - a new version of delta-rs was recently released with parquet2 support, see here. Thanks for adding this @houqp! Will you be able to add delta-rs support now?

@esadler-hbo
Copy link

Also want to say this would be great. I bet your implementation will be close to fast as the photon compute engine Databricks charges way too much for.

@MrPowers
Copy link

Is anyone willing to take on this work? There are a lot of delta-rs developers that are willing to help with code reviews and any issues you might come across. Feel free to ping me directly or here if you're interested.

winding-lines added a commit to winding-lines/polars that referenced this issue Nov 26, 2022
Integrate the delta-rs library for interacting with the Delta reader.

pola-rs#2858
@winding-lines
Copy link
Contributor

I didn't realize that my playpen pushes would link back here 😂 @MrPowers I am not really sure what my next steps are, let me know if you are still available for some guidance on how to implement this feature.

@chitralverma
Copy link
Contributor

I'm working in this feature, will raise a PR soon.

@MrPowers
Copy link

MrPowers commented Dec 9, 2022

@chitralverma - that's awesome. Let me know if you need any help. We can jump on a call with the core delta-rs devs anytime. Really excited about this feature. I'll blog / promote it as soon as it is live 🚀

@chitralverma

This comment was marked as outdated.

@chitralverma

This comment was marked as outdated.

@chitralverma
Copy link
Contributor

chitralverma commented Dec 11, 2022

Update: The read_delta and scan_delta functionalities are merged via #5761 ! 🎊

https://pola-rs.github.io/polars/py-polars/html/reference/io.html#delta-lake

@dridk
Copy link

dridk commented Dec 18, 2022

hi !
Thanks for this amazing feature .
What about the writer function ?
I would love to avoid spark and only use rust.

@chitralverma
Copy link
Contributor

hi !
Thanks for this amazing feature .
What about the writter function ?
I would love to avoid spark and only use rust.

I was working on it, but the plan is to put it on python side.

but then, for now it's blocked by delta-io/delta-rs#1024

delta-rs doesn't support large_string type

@lordirah
Copy link

Any update on this ? waiting for this feature

@chitralverma
Copy link
Contributor

Any update on this ? waiting for this feature

the lazy/ eager reader is already in place.

for the writer, #7574 is now open

@stinodego
Copy link
Contributor

I'll close this in favor of #7574 , as it's more specific.

read/scan functionality has been implemented, write functionality is being worked on.

@abiratsis
Copy link

@stinodego do you know if there is any plan to support streaming?

@stinodego
Copy link
Contributor

Yes, it's planned but won't be there anytime soon:
#11039

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.