Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CONVERT TO DELTA as a pure delta-rs API #1682

Closed
ericsun2 opened this issue Sep 29, 2023 · 3 comments · Fixed by #1686
Closed

CONVERT TO DELTA as a pure delta-rs API #1682

ericsun2 opened this issue Sep 29, 2023 · 3 comments · Fixed by #1686
Labels
enhancement New feature or request

Comments

@ericsun2
Copy link

ericsun2 commented Sep 29, 2023

Description

The equivalent API in delta-rs to CONVERT TO DELTA parquet.'s3://my-bucket/parquet-data'; directly in Python.

It is https://docs.delta.io/latest/api/python/index.html#delta.tables.DeltaTable.convertToDelta for delta-rs without Spark.

Use Case
Let's say that we have a bunch of Parquet directories which can be quickly/effectively converted to Delta.

2nd request: generate Uniform 3.0 manifest via delta-rs. That will be great.

Related Issue(s)
#1041

@ericsun2 ericsun2 added the enhancement New feature or request label Sep 29, 2023
@rtyler
Copy link
Member

rtyler commented Sep 29, 2023

@ericsun2 I wrote and use oxbow for these purposes. I'm not sure what an API like that would look like in Python. What you reference is a Spark SQL command, which we don't exactly have an equivalent interface in the Python connector. How would you envision that?

@MrPowers
Copy link
Collaborator

Here are the PySpark APIs for that functionality:

# Convert unpartitioned parquet table at path 'path/to/table'
deltaTable = DeltaTable.convertToDelta(spark, "parquet.`path/to/table`")

# Convert partitioned parquet table at path 'path/to/table' and partitioned by integer column named 'part'
partitionedDeltaTable = DeltaTable.convertToDelta(spark, "parquet.`path/to/table`", "part int")

Perhaps we could expose something similar in delta-rs?

@ericsun2
Copy link
Author

ericsun2 commented Oct 19, 2023

Both SparkSQL and PySpark have similar statement or API already, but it makes sense to have the same API in delta-rs so that we can directly invoke such "convert to delta" function in a light-weight fashion inside a non-Spark context (such as microservice, Lambda, Python script) w/o having a running SQL Warehouse or spinning up a PySpark process.

Especially when we have some services which can generate a lot of Parquet files instead of CSV/JSON files, it will be quite useful to generate the Delta + Iceberg + Hudi manifest metadata directly and swiftly before we convert and compact such Parquet directories into Delta Table. I envision this "convert to delta" function can support the additional option to generate Iceberg and Hudi metadata optionally if user choose to opt-in as well. This will be perfectly aligned with Uniform 3.0 standard.

roeap pushed a commit that referenced this issue Nov 12, 2023
# Description
Add a convert_to_delta operation for converting a Parquet table to a
Delta Table in place.

# Related Issue(s)
- closes #1041
- closes #1682
<!---
For example:

- closes #106
--->

# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants